Published some notes on Docling, a rather nice MIT licensed Python PDF document / table extraction library from IBM https://simonwillison.net/2024/Nov/3/docling/
Published some notes on Docling, a rather nice MIT licensed Python PDF document / table extraction library from IBM https://simonwillison.net/2024/Nov/3/docling/ 4 comments
@matt I tried it on two documents so far and it looked reasonable, but I've not done a remotely robust comparison of it yet @xsc I tried it on two PDDs and it looked OK, which isn't nearly enough testing for me to say anything useful! |
@simon How does the Markdown output from Docling compare with the HTML that you've gotten out of Gemini for PDF documents? Does Docling do a good job of recognizing headings, lists, etc.?