this is the way to get the correct tags:
(on mac i needed to install gnu grep with homebrew `brew install grep` and then use `ggrep` )
will follow up with dataset tomorrow.
https://twitter.com/horsemankukka/status/1486268962119761924?s=20
Top-level
this is the way to get the correct tags: 8 comments
updated the above gist with correctly extracted tags, and included python code to extract your own, feel free to add them in the comments. since we don't know what they contain yet not adding other metadata. definitely patterned, not a hash, but idk yet. follow-up: there does not appear to be any further watermarking: taking two files with different identifying tags, stripping metadata, and relinearizing with qpdf's --deterministic-id flag yields PDFs identical with a diff, ie. no differentiating watermark (but plz check my work) It's a couple things: a) Elsevier's vendor's tool only has to be good enough to impress Elsevier b) Deterrence being more efficient than prevention |
of course there's smarter watermarking, the metadata is notable because you could scan billions of pdfs fast. this comment on HN got me thinking about this PDF /OpenAction I couldn't make sense of earlier, on open, access metadata, so something with sizes and layout...