Links:
exiftool: https://www.exiftool.org/
qpdf: https://qpdf.sourceforge.io/
dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2
Top-level
Links: 18 comments
The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured for any security researchers out there, here are a few more "hashes" that a few have noted do not appear to be random and might be decodable. exiftool apparently squashed the whitespace so there is a bit more structure to them than in the OP: this is the way to get the correct tags: of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout... updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet. follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work) It's a couple things: a) Elsevier's vendor's tool only has to be good enough to impress Elsevier b) Deterrence being more efficient than prevention |
here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6