OK, I _swear_ I didn’t do this on purpose but PDF sucks...

OK, I _swear_ I didn’t do this on purpose but PDF sucks so bad as a publishing format that the easiest way to build search traffic to a website turns out to be republishing information that’s otherwise locked up in a PDF

Like 16 September at 14:56 | Open on fedi.simonwillison.net

8 comments

Simon Willison

Heck, I did it with a paper that had been published by Google’s own research team and my version was in the top three results on Google within 2 hours of putting it online https://simonwillison.net/2024/Aug/24/pipe-syntax-in-sql/

Don’t let friends publish useful information in PDFs!

16 September at 14:58 | Open on fedi.simonwillison.net

Bill

@simon LaTeX 4 lyfe!

16 September at 15:05 | Open on infosec.exchange

DELETED

@simon
folks are not going to enjoy me saying this, but MongoDB uses a pipeline syntax, where the results of the last step, can be fed to the next step.

When I ralised this, it was harder to dislike Mongo.

16 September at 16:02 | Open on phpc.social

Daniel

@simon I really really hope the SEO grifters are not on here reading this 🙈

16 September at 15:10 | Open on chaos.social

Jeremiah Lee

@simon Most of my PDF creation is solving the “how do I save the current content of a webpage as a single file and not just an image” problem.

SingleFile <https://github.com/gildas-lormeau/SingleFile> and/or MHTML <https://en.wikipedia.org/wiki/MHTML> should be more universal than they are.

16 September at 15:22 | Open on alpaca.gold

Mark T. Tomczak

@simon I actually wonder if it's a technical challenge issue.

There are so many ways to create a PDF that's readable to humans and illegible to computers that it's much, much easier to make something search-engine-friendly in HTML format. Even in the case of Google where, I imagine, they can OCR that shit, that pipeline's gotta be more of a bottleneck than interpreting token-stripped HTML because it just costs more resources to transform images of text.

And that's before we factor in the human element: Google still gets signal on popularity from clicks, and if I see a PDF in the wild, my default response is "No thank you; I do not want this information in likely-unsearchable page-by-page form that'll be harder to consume than a plain web page."

@simon I actually wonder if it's a technical challenge issue.

Expand text...

16 September at 16:59 | Open on mastodon.fixermark.com

Simon Willison

@mark I'm sure that's what's going on here - HTML is a far better format for machine-readability than PDF

16 September at 17:13 | Open on fedi.simonwillison.net

Francis 🏴‍☠️ Gulotta

@simon or from an old ssl invalid website

17 September at 4:23 | Open on toot.cafe