I recommend reading the descriptions closely and comparing...

I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result

Like 19 October at 16:57 | Open on fedi.simonwillison.net

7 comments

Leaping Woman

@simon yep, which is particularly not helpful for users of screen readers.

19 October at 17:12 | Open on spore.social

Simon Willison

@leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions

Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop!

19 October at 17:20 | Open on fedi.simonwillison.net

Simon Willison

@leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible

19 October at 17:21 | Open on fedi.simonwillison.net

Leaping Woman

@simon That's the way to do it. Both with image descriptions and with automatic speech-to-text, editing the machine version is key.

19 October at 19:25 | Open on spore.social

Joseph Szymborski :qcca:

@simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one:

The image shows a large, neoclassical-style building with white stone walls and columns. The building is identified as the "PIONEER MEMORIAL MUSEUM" by text above its entrance. In front of the building stands a statue, though details of the statue are not clear from this distance.
The foreground of the image shows a sign that reads:
"HEADQUARTERS
INTERNATIONAL SOCIETY
DAUGHTERS OF UTAH PIONEERS"
The building is surrounded by trees, some of which are beginning to bud or leaf out, suggesting it's spring. The sky appears overcast with some clouds visible.
There are sidewalks leading up to the building, and a street is visible to the right side of the image. The overall setting appears to be in an urban or suburban area, likely in Utah given the reference to "Utah Pioneers" on the sign.

19 October at 18:28 | Open on cosocial.ca

Sigismund Ninja

@simon @leapingwoman also, the llama 3.2 model is quantized so that it uses 4 bit weights (instead of original 16 bit). And the model is fine-tuned for material sciences.

https://huggingface.co/lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k

19 October at 19:40 | Open on mastodon.nu

D.Hamlin.Music

@simon This is #Talkback on the second image.

The image is of a vintage gas station. The image is nostalgic and captures the feeling of a bygone era. The main focus of the image is a red and white gas pump, standing tall and proud. The gas pump is gleaming under soft light filtering through the wooden ceiling. Next to the pump is a white and black gas canister, adding to the authenticity of the setting. A red and white gasoline sign hangs from the ceiling and a yellow and white gasoline sign is suspended above it. The floor beneath these relics of the past is a checkerboard pattern, a common design choice for gas stations of yesteryears. In the background, a variety of other signs and advertisements add to the eclectic mix of objects. They are a testament to the diverse range of products and services that were once available at this location. The image captures a snapshot of history, frozen in time, waiting to be discovered and appreciated by those who take the time to look closer.

@simon This is #Talkback on the second image.

Expand text...

19 October at 17:17 | Open on dragonscave.space