Email or username:

Password:

Forgot your password?
Simon Willison

I added multi-modal (image, audio, video) support to my LLM command-line tool and Python library, so now you can use it to run all sorts of content through LLMs such as GPT-4o, Claude and Google Gemini

Cost to transcribe 7 minutes of audio with Gemini 1.5 Flash 8B? 1/10th of a cent.

simonwillison.net/2024/Oct/29/

But let’s do something a bit more interesting. I shared a 7m40s MP3 of a NotebookLM podcast a few weeks ago. Let’s use Flash-8B—the cheapest Gemini model—to try and obtain a transcript.

llm 'transcript' \
  -a https://static.simonwillison.net/static/2024/video-scraping-pelicans.mp3 \
  -m gemini-1.5-flash-8b-latest

It worked!

    Hey everyone, welcome back. You ever find yourself wading through mountains of data, trying to pluck out the juicy bits? It’s like hunting for a single shrimp in a whole kelp forest, am I right? Oh, tell me about it. I swear, sometimes I feel like I’m gonna go cross-eyed from staring at spreadsheets all day. [...]

Once again, llm logs -c --json will show us the tokens used. Here it’s 14754 prompt tokens and 1865 completion tokens. The pricing calculator says that adds up to... 0.0833 cents. Less than a tenth of a cent to transcribe a 7m40s audio clip.
20 comments
Ame

@simon Since you have the tokens readily available what do you think about including the pricing calculator directly inside `llm`?
aider is doing a similar thing by directly showing you how much you were billed for each response.

Simon Willison

@ame I want to do a bit more with token accounting - maybe store them in separate database columns - but I'm not so keen on doing the price calculations in the tool because I can't promise correct results, since the prices often change sometimes without any warning

I'd be OK outsourcing that to a plugin though

Andrea Borruso

@simon first of all thank you very much.

In your llm logs output you have "total_tokens". I do not have it, I have a lot of time "totalTokenCount": 3522"

gist.github.com/aborruso/6004d

Am I doing something wrong

Simon Willison

@aborruso that output is different for different models - I have a future plan to normalize those and store them separately in the database

Andrea Borruso

@simon Sorry, another stupid question: and in my case I have to add up all the totalTokenCounts?

Thank you

Simon Willison

If you are still LLM-skeptical but haven't spent much time thinking about or experimenting with these multi-modal variants I'd encourage you to take a look at them

Being able to extract information from images, audio and video is a truly amazing capability, and something which was previously prohibitively difficult - see XKCD 1425 xkcd.com/1425/

Matthew Martin

@simon Half a century later, it is a solved problem.

Daniel

@simon Note that some of us are skeptical for reasons such as the exploitation of creative folks, the copyright infringements at scale, the hype cycle created by venture capital, the impact it has on misinformation and the ads space, and so on. Some of the tech is cool no doubt.

Simon Willison

@djh those are all very valid reasons to be skeptical!

The only reason I'll consistently push back at is the idea that these things aren't useful at all

Andrei Zmievski

@simon Something I've been meaning to ask.. is there a decent guide to which models are best suited for which tasks? As in, "gemini models are better for extracting content from video/audio, etc", including model versions, sizes, etc.

Xing Shi Cai

@simon Does video work? I tried both Gemini pro and flash, but I only got some error message. Do I need a paid account to use video scraping? (Image works as expected.)

Simon Willison

@xsc video should work, what file format were you trying? Currently needs to be less than 20MB - that's a temporary limitation of my llm-gemini plugin

Xing Shi Cai

@simon I was using an MP4 of 5 mb size. The error just says "internal error" I downloaded the video from here pexels.com/video/catching-and-

Simon Willison

@xsc I've seen a few of those "Internal error" messages too - I think it's Gemini being a little bit flaky, sometimes resubmitting works fine the second time

Xing Shi Cai

@simon I was using the following command

> llm 'please explain what is happening in the video' -a man-in-water.mp4 -m gemini-1.5-flash-latest

Does it look like it should work?

Simon Willison

@xsc yes, if you have the llm-gemini plugin installed and configured with an API key

You could try using this script here (or using Google's AI Studio tool) ti check it's not an LLM bug: til.simonwillison.net/llms/pro

Go Up