I finally managed to get the Llama 3.2 and Phi 3.5 vision models to run on my M2 Mac laptop, using the mistral.rs Rust library and its CLI Tool and Python bindings https://simonwillison.net/2024/Oct/19/mistralrs/
I finally managed to get the Llama 3.2 and Phi 3.5 vision models to run on my M2 Mac laptop, using the mistral.rs Rust library and its CLI Tool and Python bindings https://simonwillison.net/2024/Oct/19/mistralrs/ 19 comments
I then used the mistralrs-metal Python library to run this photo from Mendenhall's Museum of Gasoline Pumps & Petroliana: through Microsoft's Phi-3.5 Vision model https://www.niche-museums.com/107 "What is shown in this image? Write a detailed response analyzing the scene." @simon wow! You're now making me want the M4 to be announced soon π Very impressive. I'm trying to find many of the objects it's pointing out, and while I can guess what it's referring to, I would struggle to say that it is accurate in describing things in the scene. e.g. I see a gas canister, but it isn't white and black, nor is it adjacent to a pump which is red and white (although it is adjacent to two pumps, being red and white respectively). @jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text! I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result @leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop! @leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible @simon That's the way to do it. Both with image descriptions and with automatic speech-to-text, editing the machine version is key. @simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one: @simon @leapingwoman also, the llama 3.2 model is quantized so that it uses 4 bit weights (instead of original 16 bit). And the model is fine-tuned for material sciences. https://huggingface.co/lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k @prem_k do you know if anyone has figured out a recipe for running that on the Mac yet? @simon no, not yet. I'm yet to look into the model files, but if they're available as gguf or onnx, it should be possible to run with llama.cpp or wllama for gguf and Transformers.js for onnx. It's also possible to convert gguf files for running with ollama. @simon thanks for this! I had some issues however replicating this, where on an M3 max it always crashes. (Plus also annoying that it also crashes or errors if it cannot find an image. There is a PR to fix that, but it's not merged yet) Like even on the M3 MAX, as the in-situ quantization is done on one core, it takes a while... have you experienced one or all of these? @fl0_id it didn't crash on me but I literally only did the things in my write-up, I haven't explored beyond that yet |
Here's what I got from Llama 3.2 11B for this photo I took at the Pioneer Memorial Museum in Salt Lake City https://www.niche-museums.com/111
"describe this image including any text"