More fun publisher surveillance: Elsevier embeds a...

More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.

[A list of metadata for a PDF, the important field being two "Unknown:<long random character string>" fields that are color coded to indicate that they have been changed between versions.

Like 25 Jan 2022 at 23:36 | Open on social.coop

50 comments

серафими многоꙮчитїи

@jonny they look kind of meaningful. Not base64. Any ideas what could be in there?

25 Jan 2022 at 23:43 | Open on octodon.social

jonny

@derwinmcgeary
yeah, I thought so too but don't know where to start reverse engineering it :/

26 Jan 2022 at 3:30 | Open on social.coop

jonny

@derwinmcgeary
it decodes with base85, but it's not Unicode. not sure if that's meaningful

26 Jan 2022 at 3:41 | Open on social.coop

Old Tom

@jonny I do not have any IT skills, but if I did I’d love to write a script to remove metadata from PDFs. Adobe has them wrapped up pretty well.

25 Jan 2022 at 23:54 | Open on social.coop

jonny

You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

26 Jan 2022 at 0:03 | Open on social.coop

jonny

Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.

Internet takedown programs

Elsevier partners with a technology vendor to continuously search the Internet for unauthorized posting of our book and journal content. In accordance with the Digital Millennium Copyright Act (DMCA), we issue “takedown” notices to the operators of websites hosting such unauthorized content. To complement this automated searching, Elsevier maintains online tools for staff to report an infringed work. Our partner then works to expedite reporting, investigation, and removal of specific infringing content. If you discover, or learn about pirated content online, don’t hesitate to let your contact at Elsevier know about it; he or she can use our internal systems to make sure the problem is quickly addressed.

26 Jan 2022 at 0:23 | Open on social.coop

jonny

Links:
exiftool: https://www.exiftool.org/
qpdf: https://qpdf.sourceforge.io/
dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2

26 Jan 2022 at 0:28 | Open on social.coop

jonny

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6

[Screenshot of code at URL in tweet, the script first uses "find" to locate the files, and passes them to a while loop. It creates a clean PDF at a temporary file, removing it if one exists already. Code follows]

# Color Codes so that warnings/errors stick out
GREEN="\e[32m"
RED="\e[31m"
CLEAR="\e[0m"

# loop through all PDFs in first argument ($1),
# or use '.' (this directory) if not given
DIR="${1:-.}"

echo "Cleaning PDFs in directory $DIR"

# use find to locate files, pip to while read to get the
# whole line instead of space delimited
# Note -- this will find pdfs recursively!!
find $DIR -type f -name "*.pdf" | while read -r i
do

# output file as original filename with suffix _clean.pdf
TMP=${i%.*}_clean.pdf

# remove the temporary file if it already exists
if [ -f "$TMP" ]; then
rm "$TMP";
fi

exiftool -q -q -all:all= "$i" -o "$TMP"
qpdf --linearize --replace-input "$TMP"
echo -e $(printf "${GREEN}Processed ${RED}${i} ${CLEAR}as ${GREEN}${TMP}${CLEAR}"

26 Jan 2022 at 2:38 | Open on social.coop

jonny

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19

26 Jan 2022 at 4:08 | Open on social.coop

jonny

for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
https://gist.github.com/sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f

26 Jan 2022 at 5:02 | Open on social.coop

jonny

https://twitter.com/kmagnacca/status/1486209676979032064?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

26 Jan 2022 at 5:32 | Open on social.coop

jonny

https://twitter.com/SchmiegSophie/status/1486206774159970305?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

26 Jan 2022 at 5:35 | Open on social.coop

jonny

this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
https://twitter.com/horsemankukka/status/1486268962119761924?s=20

26 Jan 2022 at 9:32 | Open on social.coop

jonny

of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...

[top comment on HN thread]

So just take pics of the pages and convert the pics back to a PDF

[first sub-comment]

A motivated publisher could embed codes by altering in subtle ways the differences in distances or color between adjacent characters, so that they would survive most color or grey scale conversions; a seemingly innocuous frame drawn around a photo could be either larger or smaller by say one millimeter, representing de facto a bit, therefore using enough pages they could identify a book among billions. Unfortunately there's no way to be 100% sure that a complex document doesn't contain some form of embedded code.

[second sub-comment]

Easier to just strip out the metadata

I don't really know what I'm looking at so I can't really describe it. There's a top part that says "Suspicious elements: /OpenAction" and then when I list its properties there is an access to the metadata, some changes to a crop box, etc.

26 Jan 2022 at 9:47 | Open on social.coop

jonny replied to jonny

updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
https://twitter.com/json_dirs/status/1486289288115359747?t=QwmBvbOgh2fCkjSOZSh3Fw&s=19

26 Jan 2022 at 11:00 | Open on social.coop

jonny replied to jonny

you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?

26 Jan 2022 at 11:34 | Open on social.coop

jonny replied to jonny

follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)

27 Jan 2022 at 3:15 | Open on social.coop

jonny replied to jonny

which is surprising to me, so I'm a little hesitant to make that as a general claim

27 Jan 2022 at 3:18 | Open on social.coop

Nick Astley replied to jonny

@jonny

It's a couple things:

a) Elsevier's vendor's tool only has to be good enough to impress Elsevier

b) Deterrence being more efficient than prevention

10 June at 12:30 | Open on urbanists.social

shusha replied to jonny

@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!

27 Jan 2022 at 8:39 | Open on post.lurk.org

jonny replied to shusha

@shusha
yes definitely, love it and spend basically all my time reading it nowadays ❤️

27 Jan 2022 at 9:02 | Open on social.coop

robryk

@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)

27 Jan 2022 at 1:33 | Open on qoto.org

jonny

@robryk
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical

27 Jan 2022 at 5:39 | Open on social.coop

robryk

@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).

27 Jan 2022 at 8:57 | Open on qoto.org

jonny

@robryk
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)

27 Jan 2022 at 9:02 | Open on social.coop

Advanced Persistent Teapot

@jonny hmmm, alternatively start inserting copies of that metadata into blank or template PDFs. Send 'em chasing wild geese and make them look at Lorem Ipsum 50 times a day 😈

10 June at 8:41 | Open on hachyderm.io

Thai Thien

@http_error_418 @jonny Very well.
Do you have code to insert those code ? I would like to help.
Btw, you can use mathgen https://thatsmathematics.com/mathgen/ to make meaningless paper to upload on scribd

10 June at 9:42 | Open on masto.ai

:garfield:‍fuchsiaaaaaaaaaaaaaaaaa

@jonny word of caution is that while removing exif is good, knowing publishers there's a bunch of other ways they'd directly include such trackers into the file, in a less human/machine readable spot than EXIF. so be careful

26 Jan 2022 at 0:45 | Open on social.pixie.town

yetzt

@jonny

meme, bad idea: "removing hashes", good idea: "replacing them with longer hashes that cause buffer overflows in their scrapers"

10 June at 9:14 | Open on vis.social

Stewart Russell

@jonny they're almost getting to the level of ISO standards for metadata f'wittery.

For a while, many ISO standards that you bought (for $$$$) looked like a bad photocopy. If you zoomed in really close to the marks on the page, they were made up of a pattern of punctuation characters. Totally screwed up any screen reading, though

26 Jan 2022 at 2:07 | Open on mastodon.social

Orca🌻 | 🏴🏳️‍⚧️

@jonny@social.coop seems like some countermeasures against scihub, libgen and other shadow libraries that provide those PDFs for free 🤨

26 Jan 2022 at 9:44 | Open on nya.one

KawaiiPunk

@jonny this is the same technique that was being used in the OS designed in North Korea called Red Star OS. It was in the Chaos Congress talk about it.

26 Jan 2022 at 10:25 | Open on sunbeam.city

jonny

@kawaiipunk
interesting ... will take a look.

26 Jan 2022 at 19:40 | Open on social.coop

Gord

@jonny they really are a right bunch of bastards, aren’t they?

9 June at 18:26 | Open on infosec.exchange

🦇Lyle Solla-Yates🦇

@jonny this makes me think some horrible things are going to happen to students because of this but I can’t quickly think of an example

9 June at 20:16 | Open on cville.online

Carl Mathias Kobel 🧬🦠🧫👩‍💻

@jonny But what meaningful data can they attach to that unique ID? The IP adress? Assume a user is not logged in, has cleared tracking cookies and is using a VPN.
Wait a sec. That is why we need open access.

9 June at 20:28 | Open on genomic.social

jonny

@cmkobel
Browser fingerprinting is p robust.
https://www.amiunique.org/fingerprint
And even those mitigations wont be taken by 99.999% of visitors.

9 June at 21:12 | Open on social.coop

OddOpinions5

@jonny

note: I am a tech illiterate

I went to the exif tool website and clearly not somethng I would find easy to use

so I googled
tool to remove pdf metadata

and it seems like there are lots of nice easy to use programs - google chrome has one built in ?

my fav software seems to have a metadata removal/sanitize tool
https://www.pdf-xchange.com/search?query=remove+metadata

and also

https://tools.pdf24.org/en/remove-pdf-metadata

https://www.pdfgear.com/pdf-editor-reader/how-to-remove-metadata-from-pdf-file.htm

@jonny

note: I am a tech illiterate

I went to the exif tool website and clearly not somethng I would find easy to use

so I googled
tool to remove pdf metadata

and it seems like there are lots of nice easy to use programs - google chrome has one built in ?

my fav software seems to have a metadata removal/sanitize tool
https://www.pdf-xchange.com/search?query=remove+metadata

Expand text...

10 June at 1:17 | Open on mas.to

Susa

@jonny
Thank you for the thread!

10 June at 6:14 | Open on rheinneckar.social

Lou Katz

@jonny Can this crap be removed?

10 June at 8:19 | Open on entropyservice.com

Sertonix

@jonny uBlock is able to modify the response of http requests. Maybe somebody can create a filter to strip the hash from the file.

10 June at 8:23 | Open on fosstodon.org

Serge from Babka

@jonny

This is for sure embedded watermarking but I'm not seeing active surveillance in this (unlike, say Kindle or the myraiad of phone-home technologies).

While I would strongly prefer Free Culture media where there is permission granted for distribution, modification, attribution mandates, and redistribution, if the media is DRM-free but does not permit widespread distribution, what mechanisms do you see as appropriate for the copyright holder to use to identify unauthorized distributed copies?

10 June at 8:44 | Open on babka.social

Gerard Ritsema van Eck

@serge @jonny
They could easily use this to track who is uploading papers to shadow libraries and slap them or their institutions with huge lawsuits.

For-profit publishing has no place in modern #academia.

10 June at 9:12 | Open on mstdn.social

Serge from Babka

@Gerard

I agree with you that for-profit publishing and academia are not good bedfellows, and I generally think copyright is broken in more ways than one. The issue for me is given the current legal environment, this seemed a simple way to address unauthorized duplication that doesn't involve DRM or other surveillance tech.

@jonny

10 June at 12:35 | Open on babka.social

Gerard Ritsema van Eck

@serge @jonny
Very true!

10 June at 13:29 | Open on mstdn.social

KawaiiPunk

@jonny this is the same tracking technique used in the operating system from North Korea called Red Star OS.

10 June at 9:12 | Open on sunbeam.city

http :verified:

@jonny If this "hash" is the only difference, then it can easily be removed or replaced, as it seems to be related to a user id. The question is if there are further differences, but looking at your later posts it seems that no.

10 June at 10:15 | Open on infosec.exchange

Gilgwath

@jonny Aaah yes, DMCA against random citizens, enforced by the same govs who original funded the majority of the papers in the first place. So, the tax payer gets to pay for the education, the research, the access fee AND the cops and legal system who have to go after the right minded citizens who publish the stuff they already payed for at least twice. This is fine. 🔥 The market will regualte itself.

10 June at 10:16 | Open on social.tchncs.de

Rue Mohr

@jonny Why else would they require a free account and login to download it...

10 June at 12:59 | Open on infosec.exchange

Jacques Chester

@jonny @SwiftOnSecurity those look sequential rather than fully random hashes

10 June at 13:49 | Open on mastodon.chester.id.au

Go Up