Email or username:

Password:

Forgot your password?
Simon Willison

I built a new tool: tools.simonwillison.net/ocr - it runs OCR against images and PDFs entirely in your browser (no file upload needed) using Tesseract.js and PDF.js

I wrote more about the tool and how I built it (with copious amounts of Claude 3 Opus and a little bit of ChatGPT) here: simonwillison.net/2024/Mar/30/

18 comments
Molly White

@simon nice! i was thinking of trying to do something similar to autogenerate alt text, which i currently tend to do by opening images in chrome and using google lens (far too many clicks)

Simon Willison

@molly0xfff Yes! I first used something like this for the alt text in my annotated presentation tool here: til.simonwillison.net/tools/an

aaron schaffer

@simon Very cool. Though I get a Heroku error when I try to go to your site ("Application error: An error occurred in the application and your page could not be served. If you are the application owner, check your logs for details. You can do this from the Heroku CLI with the command heroku logs --tail")

Simon Willison

@aaronjschaffer Huh... it looks like it's the Mastodon effect, where sending out a link causes thousands of Mastodon servers to all hit /.well-known/webfinger?resource=acct:simon@simonwillison.net at the same time - but I've survived these storms just fine in the past, not sure why it's hurting the site today

Nelson Minar 🧚‍♂️

@simon under 8MB! 3MB each for Tesseract WASM and the training data.

Prem Kumar Aparanji 👶🤖🐘

@simon need more such browser-only, offline-first, privacy-first apps that don't require any installation or configuration!

Stuart Gray

@prem_k @simon If you didn't see it at the time, this was quite a cool offline browser-based transcription tool posted a few weeks back:

bne.social/@simon/112057608292

Like you, I love these kinds of tools but if I could *beg* the authors for one feature - please make it easy to download the needed files so I can run it all truly offline :)

Simon Willison

@StuartGray @prem_k That really is a worthwhile feature for this one, I've opened an issue - no promises I'll solve it though, there are things in there relating to bundling that I don't know how to do yet github.com/simonw/tools/issues

Simon Willison

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter github.com/simonw/tools/blob/9

Simon Willison

Also neat is that the enabling libraries here - Tesseract.js and PDF.js - are both pretty old at this point:

First commit to Tesseract.js was Jun 26, 2015 github.com/naptha/tesseract.js

First to PDF.js was Apr 25, 2011 github.com/mozilla/pdf.js/comm

Julia

@simon Tesseract (the non-JS version) was originally created by HP in the 1980s and open-sourced in 2005.

Simon Willison

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

gabi

@simon Insanely cool! It works fine in Android Chrome (no luck with Firefox though).

ResearchBuzz

@simon Damn, that's awesome! Queued for ResearchBuzz.

Go Up