Video scraping: extracting JSON data from a 35 second...

Video scraping: extracting JSON data from a 35 second screen capture for less than 1/10th of a cent https://simonwillison.net/2024/Oct/17/video-scraping/

I needed to extract information from a dozen emails in my inbox... so I ran a screen capture tool, clicked through each of them in turn and then got Google's Gemini 1.5 Flash multi-modal LLM to extract (correct, I checked it) JSON data from that 35 second video.

Total cost for 11,018 tokens: $0.00082635

Like 17 October at 12:47 | Open on fedi.simonwillison.net

25 comments

Simon Willison

Bonus from that post: I got fed up of calculating token prices by hand, so I had Claude Artifacts spin up this pricing calculator tool with presets for all of the major models https://tools.simonwillison.net/llm-prices

Screenshot of LLM Pricing Calculator interface. Left panel: input fields for tokens and costs. Input Tokens: 11018, Output Tokens: empty, Cost per Million Input Tokens: $0.075, Cost per Million Output Tokens: $0.3. Total Cost calculated: $0.000826 or 0.0826 cents. Right panel: Presets for various models including Gemini, Claude, and GPT versions with their respective input/output costs per 1M tokens. Footer: Prices were correct as of 16th October 2024, they may have changed.

17 October at 13:06 | Open on fedi.simonwillison.net

Simon Willison

Here's another example of multi-modal vision LLM usage: I collected the prices for the different preset models by dumping screenshots of their pricing pages directly into the Claude conversation

Full transcript here: https://gist.github.com/simonw/6b684b5f7d75fb82034fc963cc487530

Claude: Is there anything else you'd like me to adjust or explain about this updated calculator? Me: Add a onkeyup event too, I want that calculator to update as I type. Also add a section underneath the calculator called Presets which lets the user click a model to populate the cost per million fields with that model's prices - which should be shown on the page too. I've dumped in some screenshots of pricing pages you can use - ignore prompt caching prices. There are five attached screenshots of pricing pages for different models.

17 October at 13:10 | Open on fedi.simonwillison.net

Kyle Hughes

@simon Leaks show that the ChatGPT Mac and/or web app are going to get screen sharing soon via the Realtime API. Seems like this is the next frontier: dumping the whole personal computing experience into models.

17 October at 14:09 | Open on mister.computer

Simon Willison

Video scraping in Ars Technica! https://arstechnica.com/ai/2024/10/cheap-ai-video-scraping-can-now-extract-data-from-any-screen-recording/

18 October at 16:07 | Open on fedi.simonwillison.net

Mark Eichin

@simon I'm a little confused by the OCR part - is that just some unrelated (but obviously useful) service tacked on the front, or is there some way LLMs are involved in the character recognition itself? (15 years ago OCR quality was related to text modelling, there was some interest in using our geotagger to do feedback for OCR of map labels, but I haven't dug into that space in a while)

19 October at 16:46 | Open on mastodon.mit.edu

Peter Hoffmann

@simon Do you use https://openrouter.ai to connect to different models, or do you use each service with it's own api and cost traccking?

17 October at 13:27 | Open on chaos.social

Simon Willison

@hoffmann I mostly use the service APIs directly - I have an OpenRouter account too but I like to stay deeply familiar with all of the different APIs as part of developing my https://llm.datasette.io tool

17 October at 13:29 | Open on fedi.simonwillison.net

Drew Breunig

@simon Nice! You should drop a tokenizer in there for people.

17 October at 14:02 | Open on note.computer

Simon Willison

@dbreunig I'm still frustrated that Anthropic don't release their tokenizer!

Gemini have an API endpoint for counting tokens but I think it needs an API key

17 October at 14:15 | Open on fedi.simonwillison.net

Drew Breunig

@simon Now that you mention it, I'm curious how different each platform is with tokens and how that might affect pricing (or just be a wash)

17 October at 14:16 | Open on note.computer

Simon Willison

@dbreunig yeah it's frustratingly difficult to compare tokenizers, which sure make price per million less directly comparable

17 October at 14:27 | Open on fedi.simonwillison.net

Simon Willison

@dbreunig running a benchmark that processes a long essay and records the input token count for different models could be interesting though

17 October at 14:28 | Open on fedi.simonwillison.net

Drew Breunig

@simon Someone's done it: https://medium.com/@disparate-ai/not-all-tokens-are-created-equal-7347d549af4d

17 October at 15:54 | Open on note.computer

Phil Gyford

@simon Is it also possible to calculate how much energy these things use, and some comparisons of what that's equivalent to? I hear that AI is energy intensive but I have zero concept of what that means in reality for a single "thing" like this.

18 October at 14:33 | Open on mastodon.social

Simon Willison

@philgyford if that's possible I haven't seen anyone do it yet - the industry don't seem to want to talk specifics

GPUs apparently draw a lot more power when they are actively computing than when they are idle, so there's an energy cost associated with running a prompt that wouldn't exist if the hardware was turned on but not doing anything

18 October at 15:15 | Open on fedi.simonwillison.net

jacoBOOian 👻

@simon the fact that this works as well as it did kinda blows my mind — what an absolutely _wild_ pattern for data scraping.

17 October at 13:50 | Open on social.jacobian.org

Frederik Elwert

@simon So you basically re-implemented Recall? 😉

17 October at 13:59 | Open on fedihum.org

[DATA EXPUNGED]

th0ma5

@simon great documentation ... Any details on accuracy? How much did you have to clean up the output and did you have to check it all by hand?

18 October at 5:44 | Open on mastodon.social

Simon Willison

@th0ma5 I checked it all, didn't take long (I watched the 35s video and scanned the JSON) - it was exactly correct

18 October at 6:56 | Open on fedi.simonwillison.net

Felix Westphal

@simon @th0ma5 you being surprised that this actually worked tells a lot about the state we're in and afaik with this technology we can never be sure that the result will actually be correct. So if you have to double-check anyways you can just do it yourself manually (or use another non-LLM-tool).

18 October at 16:15 | Open on fosstodon.org

Simon Willison

@superFelix5000 @th0ma5 right - the single hardest thing about learning to productively work with LLMs is figuring out how to get useful results out of inherently unreliable technology

18 October at 16:30 | Open on fedi.simonwillison.net

th0ma5

@simon @superFelix5000 Sorry my previous reply I didn't realize that you coded. The manual setup and such that you did is the hard part right? Extracting frames and OCRing them can be done with just command line tools. I think your documentation is great, but it doesn't feel like a net gain to me.

19 October at 18:38 | Open on mastodon.social

Go Up