I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result
Top-level
I recommend reading the descriptions closely and comparing them with the images - these vision models mix what they are seeing with "knowledge" baked into their weights and can often hallucinate things that aren't present in the image as a result 7 comments
@leapingwoman I've talked to screen reader users who still get enormous value out of the vision LLMs - they're generally reliable for things like text and high level overviews, where they get weird is more detailed descriptions Plus the best hosted models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) are a whole lot less likely to hallucinate than the ones I can run on my laptop! @leapingwoman I use Claude 3.5 Sonnet to help me write alt text on almost a daily basis, but I never use exactly what it spat out - I always further edit it myself for clarity and to make sure it's as useful as possible @simon That's the way to do it. Both with image descriptions and with automatic speech-to-text, editing the machine version is key. @simon @leapingwoman Yah, it looks like Calude 3.5 Sonnet is right on the money with this one: @simon @leapingwoman also, the llama 3.2 model is quantized so that it uses 4 bit weights (instead of original 16 bit). And the model is fine-tuned for material sciences. https://huggingface.co/lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k |
@simon yep, which is particularly not helpful for users of screen readers.