Email or username:

Password:

Forgot your password?
Simon Willison

OK, I _swear_ I didn’t do this on purpose but PDF sucks so bad as a publishing format that the easiest way to build search traffic to a website turns out to be republishing information that’s otherwise locked up in a PDF

8 comments
Simon Willison

Heck, I did it with a paper that had been published by Google’s own research team and my version was in the top three results on Google within 2 hours of putting it online simonwillison.net/2024/Aug/24/

Don’t let friends publish useful information in PDFs!

Lewis Cowles

@simon
folks are not going to enjoy me saying this, but MongoDB uses a pipeline syntax, where the results of the last step, can be fed to the next step.

When I ralised this, it was harder to dislike Mongo.

Daniel

@simon I really really hope the SEO grifters are not on here reading this 🙈

Jeremiah Lee

@simon Most of my PDF creation is solving the “how do I save the current content of a webpage as a single file and not just an image” problem.

SingleFile <github.com/gildas-lormeau/Sing> and/or MHTML <en.wikipedia.org/wiki/MHTML> should be more universal than they are.

Mark T. Tomczak

@simon I actually wonder if it's a technical challenge issue.

There are so many ways to create a PDF that's readable to humans and illegible to computers that it's much, much easier to make something search-engine-friendly in HTML format. Even in the case of Google where, I imagine, they can OCR that shit, that pipeline's gotta be more of a bottleneck than interpreting token-stripped HTML because it just costs more resources to transform images of text.

And that's before we factor in the human element: Google still gets signal on popularity from clicks, and if I see a PDF in the wild, my default response is "No thank you; I do not want this information in likely-unsearchable page-by-page form that'll be harder to consume than a plain web page."

@simon I actually wonder if it's a technical challenge issue.

There are so many ways to create a PDF that's readable to humans and illegible to computers that it's much, much easier to make something search-engine-friendly in HTML format. Even in the case of Google where, I imagine, they can OCR that shit, that pipeline's gotta be more of a bottleneck than interpreting token-stripped HTML because it just costs more resources to transform images of text.

Simon Willison

@mark I'm sure that's what's going on here - HTML is a far better format for machine-readability than PDF

Go Up