Here's what I got from Llama 3.2 11B for this photo I took at the Pioneer Memorial Museum in Salt Lake City https://www.niche-museums.com/111
"describe this image including any text"
Top-level
Here's what I got from Llama 3.2 11B for this photo I took at the Pioneer Memorial Museum in Salt Lake City https://www.niche-museums.com/111 "describe this image including any text" 13 comments
@simon wow! You're now making me want the M4 to be announced soon π Very impressive. I'm trying to find many of the objects it's pointing out, and while I can guess what it's referring to, I would struggle to say that it is accurate in describing things in the scene. e.g. I see a gas canister, but it isn't white and black, nor is it adjacent to a pump which is red and white (although it is adjacent to two pumps, being red and white respectively). @jszym yeah it's definitely not a completely accurate description, the vision models are even more prone to hallucination than just plain text! I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result @leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop! @leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible @simon That's the way to do it. Both with image descriptions and with automatic speech-to-text, editing the machine version is key. @simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one: @simon @leapingwoman also, the llama 3.2 model is quantized so that it uses 4 bit weights (instead of original 16 bit). And the model is fine-tuned for material sciences. https://huggingface.co/lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k |
I then used the mistralrs-metal Python library to run this photo from Mendenhall's Museum of Gasoline Pumps & Petroliana: through Microsoft's Phi-3.5 Vision model https://www.niche-museums.com/107
"What is shown in this image? Write a detailed response analyzing the scene."