I finally managed to get the Llama 3.2 and Phi 3.5...

I finally managed to get the Llama 3.2 and Phi 3.5 vision models to run on my M2 Mac laptop, using the mistral.rs Rust library and its CLI Tool and Python bindings https://simonwillison.net/2024/Oct/19/mistralrs/

Like 19 October at 16:16 | Open on fedi.simonwillison.net

19 comments

Simon Willison

Here's what I got from Llama 3.2 11B for this photo I took at the Pioneer Memorial Museum in Salt Lake City https://www.niche-museums.com/111

"describe this image including any text"

Photos of the outside of the museum. The signs are visible, but none of them say that it is in Salt Lake City in Utah.

The image shows the exterior of the Daughters of Utah Pioneers (DUP) Museum, which is also known as the Pioneer Memorial Museum. The building is located in Salt Lake City, Utah, USA.

Visible in the image are:

The DUP Museum Building: The main structure of the museum is a prominent feature in the image. It is a two-story building with a rectangular shape and a flat roof. The exterior walls are made of brick or stone, giving it a classic and historic appearance.

Signage: Above the entrance, there is a sign that reads "Daughters of Utah Pioneers Museum" in bold letters. Below this sign, another sign indicates that the building is also known as the "Pioneer Memorial Museum."

Entrance: The main entrance to the museum is marked by a doorway with a decorative frame and a pair of large wooden doors.

Landscaping: The surrounding area around the museum appears well-maintained with neatly trimmed grass and some trees nearby.

Statue or Monument: In front of the museum, there is a small statue or monument that appears to be dedicated to pioneers or early settlers.

The Daughters of Utah Pioneers (DUP) Museum showcases historical artifacts and exhibits related to Utah's pioneer history, focusing on the experiences and contributions of early settlers in the region.

19 October at 16:21 | Open on fedi.simonwillison.net

Simon Willison

I then used the mistralrs-metal Python library to run this photo from Mendenhall's Museum of Gasoline Pumps & Petroliana: through Microsoft's Phi-3.5 Vision model https://www.niche-museums.com/107

"What is shown in this image? Write a detailed response analyzing the scene."

A beautiful collection of antique American gasoline pumps

The image captures a nostalgic scene of a vintage gas station, reminiscent of a bygone era. Dominating the left side of the frame is a red and white gas pump, standing tall and proud, its metallic surface gleaming under the soft light filtering through the wooden ceiling. Adjacent to it, a white and black gas canister stands ready for service, its presence adding to the authenticity of the setting.

On the right, a red and white gasoline sign hangs from the ceiling, its bold colors contrasting with the muted tones of the surroundings. Above it, a yellow and white gasoline sign is suspended, its vibrant hues drawing the eye upwards.

The floor beneath these relics of the past is a checkerboard pattern, a common design choice for gas stations of yesteryears. It provides a stark contrast to the smooth, unblemished surfaces of the gas pumps and canisters.

In the background, a variety of other signs and advertisements add to the eclectic mix of objects. They are a testament to the diverse range of products and services that were once available at this location.

Despite the passage of time, the gas station retains a certain charm, its vintage gas pumps and signs serving as a tangible link to a different era. The image is a snapshot of history, frozen in time, waiting to be discovered and appreciated by those who take the time to look closer.

19 October at 16:27 | Open on fedi.simonwillison.net

Prem Kumar Aparanji 👶🤖🐘

@simon wow!

You're now making me want the M4 to be announced soon 😜

19 October at 16:30 | Open on mastodon.social

Leaping Woman

@simon hmm ... not bad

19 October at 16:36 | Open on spore.social

Joseph Szymborski :qcca:

@simon

Very impressive.

I'm trying to find many of the objects it's pointing out, and while I can guess what it's referring to, I would struggle to say that it is accurate in describing things in the scene.

e.g. I see a gas canister, but it isn't white and black, nor is it adjacent to a pump which is red and white (although it is adjacent to two pumps, being red and white respectively).

19 October at 16:52 | Open on cosocial.ca

Simon Willison

@jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text!

19 October at 16:54 | Open on fedi.simonwillison.net

Simon Willison

I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result

19 October at 16:57 | Open on fedi.simonwillison.net

Leaping Woman

@simon yep, which is particularly not helpful for users of screen readers.

19 October at 17:12 | Open on spore.social

Simon Willison

@leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions

Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop!

19 October at 17:20 | Open on fedi.simonwillison.net

Simon Willison

@leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible

19 October at 17:21 | Open on fedi.simonwillison.net

Leaping Woman

@simon That's the way to do it. Both with image descriptions and with automatic speech-to-text, editing the machine version is key.

19 October at 19:25 | Open on spore.social

Joseph Szymborski :qcca:

@simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one:

The image shows a large, neoclassical-style building with white stone walls and columns. The building is identified as the "PIONEER MEMORIAL MUSEUM" by text above its entrance. In front of the building stands a statue, though details of the statue are not clear from this distance.
The foreground of the image shows a sign that reads:
"HEADQUARTERS
INTERNATIONAL SOCIETY
DAUGHTERS OF UTAH PIONEERS"
The building is surrounded by trees, some of which are beginning to bud or leaf out, suggesting it's spring. The sky appears overcast with some clouds visible.
There are sidewalks leading up to the building, and a street is visible to the right side of the image. The overall setting appears to be in an urban or suburban area, likely in Utah given the reference to "Utah Pioneers" on the sign.

19 October at 18:28 | Open on cosocial.ca

Sigismund Ninja

@simon @leapingwoman also, the llama 3.2 model is quantized so that it uses 4 bit weights (instead of original 16 bit). And the model is fine-tuned for material sciences.

https://huggingface.co/lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k

19 October at 19:40 | Open on mastodon.nu

D.Hamlin.Music

@simon This is #Talkback on the second image.

The image is of a vintage gas station. The image is nostalgic and captures the feeling of a bygone era. The main focus of the image is a red and white gas pump, standing tall and proud. The gas pump is gleaming under soft light filtering through the wooden ceiling. Next to the pump is a white and black gas canister, adding to the authenticity of the setting. A red and white gasoline sign hangs from the ceiling and a yellow and white gasoline sign is suspended above it. The floor beneath these relics of the past is a checkerboard pattern, a common design choice for gas stations of yesteryears. In the background, a variety of other signs and advertisements add to the eclectic mix of objects. They are a testament to the diverse range of products and services that were once available at this location. The image captures a snapshot of history, frozen in time, waiting to be discovered and appreciated by those who take the time to look closer.

@simon This is #Talkback on the second image.

Expand text...

19 October at 17:17 | Open on dragonscave.space

Prem Kumar Aparanji 👶🤖🐘

@simon did you see that h2o.ai did well even with 0.8B model?

https://venturebeat.com/ai/small-but-mighty-h2o-ais-new-ai-models-challenge-tech-giants-in-document-analysis/

19 October at 16:32 | Open on mastodon.social

Simon Willison

@prem_k do you know if anyone has figured out a recipe for running that on the Mac yet?

19 October at 16:34 | Open on fedi.simonwillison.net

Prem Kumar Aparanji 👶🤖🐘

@simon no, not yet.

I'm yet to look into the model files, but if they're available as gguf or onnx, it should be possible to run with llama.cpp or wllama for gguf and Transformers.js for onnx.

It's also possible to convert gguf files for running with ollama.

19 October at 16:45 | Open on mastodon.social

Florian Idelberger

@simon thanks for this! I had some issues however replicating this, where on an M3 max it always crashes. (Plus also annoying that it also crashes or errors if it cannot find an image. There is a PR to fix that, but it's not merged yet) Like even on the M3 MAX, as the in-situ quantization is done on one core, it takes a while... have you experienced one or all of these?

22 October at 11:05 | Open on mastodon.social

Simon Willison

@fl0_id it didn't crash on me but I literally only did the things in my write-up, I haven't explored beyond that yet

22 October at 12:02 | Open on fedi.simonwillison.net