I added multi-modal (image, audio, video) support to...

I added multi-modal (image, audio, video) support to my LLM command-line tool and Python library, so now you can use it to run all sorts of content through LLMs such as GPT-4o, Claude and Google Gemini

Cost to transcribe 7 minutes of audio with Gemini 1.5 Flash 8B? 1/10th of a cent.

https://simonwillison.net/2024/Oct/29/llm-multi-modal/

But let’s do something a bit more interesting. I shared a 7m40s MP3 of a NotebookLM podcast a few weeks ago. Let’s use Flash-8B—the cheapest Gemini model—to try and obtain a transcript.

llm 'transcript' \
-a https://static.simonwillison.net/static/2024/video-scraping-pelicans.mp3 \
-m gemini-1.5-flash-8b-latest

It worked!

Hey everyone, welcome back. You ever find yourself wading through mountains of data, trying to pluck out the juicy bits? It’s like hunting for a single shrimp in a whole kelp forest, am I right? Oh, tell me about it. I swear, sometimes I feel like I’m gonna go cross-eyed from staring at spreadsheets all day. [...]

Once again, llm logs -c --json will show us the tokens used. Here it’s 14754 prompt tokens and 1865 completion tokens. The pricing calculator says that adds up to... 0.0833 cents. Less than a tenth of a cent to transcribe a 7m40s audio clip.

Like 29 October at 15:11 | Open on fedi.simonwillison.net

20 comments

Ame

@simon Since you have the tokens readily available what do you think about including the pricing calculator directly inside `llm`?
aider is doing a similar thing by directly showing you how much you were billed for each response.

29 October at 15:56 | Open on breta.moe

Simon Willison

@ame I want to do a bit more with token accounting - maybe store them in separate database columns - but I'm not so keen on doing the price calculations in the tool because I can't promise correct results, since the prices often change sometimes without any warning

I'd be OK outsourcing that to a plugin though

29 October at 17:16 | Open on fedi.simonwillison.net

Andrea Borruso

@simon first of all thank you very much.

In your llm logs output you have "total_tokens". I do not have it, I have a lot of time "totalTokenCount": 3522"

https://gist.github.com/aborruso/6004d5daacc1a7096a07b2af381dec2b

Am I doing something wrong

29 October at 16:26 | Open on mastodon.uno

Simon Willison

@aborruso that output is different for different models - I have a future plan to normalize those and store them separately in the database

29 October at 17:15 | Open on fedi.simonwillison.net

Andrea Borruso

@simon Sorry, another stupid question: and in my case I have to add up all the totalTokenCounts?

Thank you

29 October at 17:42 | Open on mastodon.uno

Simon Willison

If you are still LLM-skeptical but haven't spent much time thinking about or experimenting with these multi-modal variants I'd encourage you to take a look at them

Being able to extract information from images, audio and video is a truly amazing capability, and something which was previously prohibitively difficult - see XKCD 1425 https://xkcd.com/1425/

29 October at 17:27 | Open on fedi.simonwillison.net

Matthew Martin

@simon Half a century later, it is a solved problem.

29 October at 17:29 | Open on mastodon.social

Simon Willison

The LLM Python library supports attachments now as well https://llm.datasette.io/en/stable/python-api.html#attachments

Model that accept multi-modal input (images, audio, video etc) can be passed attachments using the attachments= keyword argument. This accepts a list of llm.Attachment() instances.

This example shows two attachments - one from a file path and one from a URL:

import llm

model = llm.get_model("gpt-4o-mini")
response = model.prompt(
"Describe these images",
attachments=[
llm.Attachment(path="pelican.jpg"),
llm.Attachment(url="https://static.simonwillison.net/static/2024/pelicans.jpg"),
]
)

Use llm.Attachment(content=b"binary image content here") to pass binary content directly.

29 October at 18:15 | Open on fedi.simonwillison.net

Prem Kumar Aparanji 👶🤖🐘

@simon neat!

Where can I look at the code behind this function?

29 October at 18:22 | Open on mastodon.social

Simon Willison

@prem_k more docs here: https://llm.datasette.io/en/stable/plugins/advanced-model-plugins.html#attachments-for-multi-modal-models

Implementations are spread out across different plugins, eg https://github.com/simonw/llm/blob/a44ba49c21f8d4ac30c8e41bfa5599c258ce53cc/llm/default_plugins/openai_models.py#L338 and https://github.com/simonw/llm-gemini/blob/ce82727a6950c7769a8e40bf030591d0e6f83e5e/llm_gemini.py#L135

29 October at 18:25 | Open on fedi.simonwillison.net

steve ulrich

@simon oh slick. thanks!

31 October at 15:04 | Open on botwerks.social

Daniel

@simon Note that some of us are skeptical for reasons such as the exploitation of creative folks, the copyright infringements at scale, the hype cycle created by venture capital, the impact it has on misinformation and the ads space, and so on. Some of the tech is cool no doubt.

29 October at 18:33 | Open on chaos.social

Simon Willison

@djh those are all very valid reasons to be skeptical!

The only reason I'll consistently push back at is the idea that these things aren't useful at all

29 October at 18:44 | Open on fedi.simonwillison.net

Andrei Zmievski

@simon Something I've been meaning to ask.. is there a decent guide to which models are best suited for which tasks? As in, "gemini models are better for extracting content from video/audio, etc", including model versions, sizes, etc.

30 October at 21:19 | Open on mastodon.social

Xing Shi Cai

@simon Does video work? I tried both Gemini pro and flash, but I only got some error message. Do I need a paid account to use video scraping? (Image works as expected.)

31 October at 4:09 | Open on mathstodon.xyz

Simon Willison

@xsc video should work, what file format were you trying? Currently needs to be less than 20MB - that's a temporary limitation of my llm-gemini plugin

31 October at 11:35 | Open on fedi.simonwillison.net

Xing Shi Cai

@simon I was using an MP4 of 5 mb size. The error just says "internal error" I downloaded the video from here https://www.pexels.com/video/catching-and-releasing-a-big-carp-fish-in-the-lake-5538137/

31 October at 12:20 | Open on mathstodon.xyz

Simon Willison

@xsc I've seen a few of those "Internal error" messages too - I think it's Gemini being a little bit flaky, sometimes resubmitting works fine the second time

31 October at 13:15 | Open on fedi.simonwillison.net

Xing Shi Cai

@simon I was using the following command

> llm 'please explain what is happening in the video' -a man-in-water.mp4 -m gemini-1.5-flash-latest

Does it look like it should work?

1 November at 7:02 | Open on mathstodon.xyz

Simon Willison

@xsc yes, if you have the llm-gemini plugin installed and configured with an API key

You could try using this script here (or using Google's AI Studio tool) ti check it's not an LLM bug: https://til.simonwillison.net/llms/prompt-gemini

1 November at 11:02 | Open on fedi.simonwillison.net