Email or username:

Password:

Forgot your password?
Top-level
Simon Willison

Something I really like about this tool is that the entire thing is 226 lines of combined HTML, CSS and JavaScript (plus the PDF.js and Tesseract.js dependencies, loaded from a CDN)

The code is a little untidy but at 226 lines it honestly doesn't matter github.com/simonw/tools/blob/9

3 comments
Simon Willison

Also neat is that the enabling libraries here - Tesseract.js and PDF.js - are both pretty old at this point:

First commit to Tesseract.js was Jun 26, 2015 github.com/naptha/tesseract.js

First to PDF.js was Apr 25, 2011 github.com/mozilla/pdf.js/comm

Julia

@simon Tesseract (the non-JS version) was originally created by HP in the 1980s and open-sourced in 2005.

Simon Willison

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Assuming you’ve configured AWS credentials already, this is all you need to know:

pipx install textract-cli
textract-cli image.jpeg > output.txt

My other OCR project from yesterday: textract-cli, the thinnest possible CLI wrapper around AWS's Textract API, built out of frustration at how hard that is to use!

github.com/simonw/textract-cli

It only works with JPEGs and PNGs up to 5MB in size, reflecting limitations in Textract’s synchronous API - anything more than that has to go to S3 first.

Go Up