Links: exiftool: https://www.exiftool.org/ qpdf:...

Links:
exiftool: https://www.exiftool.org/
qpdf: https://qpdf.sourceforge.io/
dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2

Like 26 Jan 2022 at 0:28 | Open on social.coop

18 comments

jonny

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6

26 Jan 2022 at 2:38 | Open on social.coop

jonny

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19

26 Jan 2022 at 4:08 | Open on social.coop

jonny

for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP:
https://gist.github.com/sneakers-the-rat/6d158eb4c8836880cf03191cb5419c8f

26 Jan 2022 at 5:02 | Open on social.coop

jonny

https://twitter.com/kmagnacca/status/1486209676979032064?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

26 Jan 2022 at 5:32 | Open on social.coop

jonny

https://twitter.com/SchmiegSophie/status/1486206774159970305?t=GT8fV5QG-4SGTkLadYpCNQ&s=19

26 Jan 2022 at 5:35 | Open on social.coop

jonny

this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
https://twitter.com/horsemankukka/status/1486268962119761924?s=20

26 Jan 2022 at 9:32 | Open on social.coop

jonny

of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...

26 Jan 2022 at 9:47 | Open on social.coop

jonny replied to jonny

updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet.
https://twitter.com/json_dirs/status/1486289288115359747?t=QwmBvbOgh2fCkjSOZSh3Fw&s=19

26 Jan 2022 at 11:00 | Open on social.coop

jonny replied to jonny

you go to school to study "the brain" and then the next thing you know you're learning how to debug surveillance in PDF rendering to understand how publishers have so contorted the practice of science for profit. how can there be "normal science" when this is normal?

26 Jan 2022 at 11:34 | Open on social.coop

jonny replied to jonny

follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work)

27 Jan 2022 at 3:15 | Open on social.coop

jonny replied to jonny

which is surprising to me, so I'm a little hesitant to make that as a general claim

27 Jan 2022 at 3:18 | Open on social.coop

Nick Astley replied to jonny

@jonny

It's a couple things:

a) Elsevier's vendor's tool only has to be good enough to impress Elsevier

b) Deterrence being more efficient than prevention

10 June at 12:30 | Open on urbanists.social

shusha replied to jonny

@jonny for the normativity of science see the discourse of STS (science and technology studien), great field!

27 Jan 2022 at 8:39 | Open on post.lurk.org

jonny replied to shusha

@shusha
yes definitely, love it and spend basically all my time reading it nowadays ❤️

27 Jan 2022 at 9:02 | Open on social.coop

robryk

@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)

27 Jan 2022 at 1:33 | Open on qoto.org

jonny

@robryk
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical

27 Jan 2022 at 5:39 | Open on social.coop

robryk

@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).

27 Jan 2022 at 8:57 | Open on qoto.org

jonny

@robryk
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)

27 Jan 2022 at 9:02 | Open on social.coop

Go Up