I spun up a new LLM benchmark: how well can they handle...

Simon's posts Post Back to profile

Simon Willison

I spun up a new LLM benchmark: how well can they handle this prompt?

Generate an SVG of a pelican riding a bicycle

I find the results so far utterly delightful: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

Here's Claude 3.5 Sonnet (2024-06-20) and Claude 3.5 Sonnet (2024-10-22):

Two images that are recognizable as pelicans on bicycles

Gemini 1.5 Flash 001 and Gemini 1.5 Flash 002:

Two images that are NOT.

Like 25 October at 23:59 | Open on fedi.simonwillison.net

8 comments

Simon Willison

OpenAI's models are quite good at it (not as good as Claude 3.5 Sonnet though)

GPT-4o mini and GPT-4o:

The GPT-4o one is clearly better, but both have distinct pelican on bicycle vibes

o1-mini and o1-preview:

The mini one is pretty bad but the o1-preview one has some promise - it's a good pelican, a bad bicycle

26 October at 0:00 | Open on fedi.simonwillison.net

Simon Willison

The Llama models I tried both did terribly, but Gemini 1.5 Flash 8B wins for weird charm (even if it doesn't really look like a pelican at all)

Cerebras Llama 3.1 70B and Llama 3.1 8B:

These both chose blue backgrounds. They are weird misshapen blobs, thoough the 70b one at least has elements of a pelican in there somewhere.

And a special mention for Gemini 1.5 Flash 8B:

This is a weird yellow heart-shaped creature with a deformed face which is somehow still quite charming

26 October at 0:02 | Open on fedi.simonwillison.net

Drew Breunig

@simon This is great! For awhile I was testing them by asking them to draw a deserted island in Processing. It was hilarious.

26 October at 0:07 | Open on note.computer

Simon Willison

Paul Calcraft extended this idea into an implementation of Pictionary where different vision LLMs generate SVGs and race to guess what the others are drawing and it is absolutely brilliant https://twitter.com/paul_cal/status/1850262678712856764

Screenshot of LLM Pictionary game showing Round 1 with a simple drawing of an orange giraffe-like figure with brown spots against blue background. Multiple AI models (Claude 3.5 Sonnet, GPT-4o, Gemini Flash/Pro, Llama) attempt to identify with responses including "duck", "swan", "rocket", "snake", "snail", and "giraffe". GPT-4o guessed correctly first. Website URL x.com/paul_cal shown

Screenshot of LLM Pictionary game showing Round 5 with a simple drawing of an ocean scene with yellow sun, bird, and waves. Multiple AI models (Claude 3.5 Sonnet, GPT-4o, Gemini Flash/Pro, Llama) attempt to identify the image with responses like "Ocean", "Sky", "Sun", and "Beach". Claude 3.5 Sonnet (June 24) guessed correctly first.

26 October at 21:15 | Open on fedi.simonwillison.net

Thierry Carrez

@simon how long until some of those models start to optimize to deal with your pelican obsession? :rofl:

26 October at 6:06 | Open on fosstodon.org

Kevin Marks

@simon saw this and thought of you https://bsky.app/profile/socalleslie.bsky.social/post/3l7ewtd4koe2x

26 October at 2:10 | Open on xoxo.zone

Jeremy Kun

@simon I guess "draw a unicorn in tikz" has already been tainted in the training data?

26 October at 3:43 | Open on mathstodon.xyz

alphaomega

@simon
Just and only my 2 cents: some people do formula 1 racing as a hobby. Or wakeboarding. Or use LLMs.
It's not forbidden. It's fascinating. It's fun.
<-> It's energy consuming

2 November at 8:29 | Open on hessen.social

Go Up