Email or username:


Forgot your password?
Mark T. Tomczak

@simon I actually wonder if it's a technical challenge issue.

There are so many ways to create a PDF that's readable to humans and illegible to computers that it's much, much easier to make something search-engine-friendly in HTML format. Even in the case of Google where, I imagine, they can OCR that shit, that pipeline's gotta be more of a bottleneck than interpreting token-stripped HTML because it just costs more resources to transform images of text.

And that's before we factor in the human element: Google still gets signal on popularity from clicks, and if I see a PDF in the wild, my default response is "No thank you; I do not want this information in likely-unsearchable page-by-page form that'll be harder to consume than a plain web page."

1 comment
Simon Willison

@mark I'm sure that's what's going on here - HTML is a far better format for machine-readability than PDF

Go Up