Here's another example of multi-modal vision LLM usage:...

Here's another example of multi-modal vision LLM usage: I collected the prices for the different preset models by dumping screenshots of their pricing pages directly into the Claude conversation

Full transcript here: https://gist.github.com/simonw/6b684b5f7d75fb82034fc963cc487530

Claude: Is there anything else you'd like me to adjust or explain about this updated calculator? Me: Add a onkeyup event too, I want that calculator to update as I type. Also add a section underneath the calculator called Presets which lets the user click a model to populate the cost per million fields with that model's prices - which should be shown on the page too. I've dumped in some screenshots of pricing pages you can use - ignore prompt caching prices. There are five attached screenshots of pricing pages for different models.

Like 17 October at 13:10 | Open on fedi.simonwillison.net

3 comments

Kyle Hughes

@simon Leaks show that the ChatGPT Mac and/or web app are going to get screen sharing soon via the Realtime API. Seems like this is the next frontier: dumping the whole personal computing experience into models.

17 October at 14:09 | Open on mister.computer

Simon Willison

Video scraping in Ars Technica! https://arstechnica.com/ai/2024/10/cheap-ai-video-scraping-can-now-extract-data-from-any-screen-recording/

18 October at 16:07 | Open on fedi.simonwillison.net

Mark Eichin

@simon I'm a little confused by the OCR part - is that just some unrelated (but obviously useful) service tacked on the front, or is there some way LLMs are involved in the character recognition itself? (15 years ago OCR quality was related to text modelling, there was some interest in using our geotagger to do feedback for OCR of map labels, but I haven't dug into that space in a while)

19 October at 16:46 | Open on mastodon.mit.edu

Go Up