Turns out we weren't done for major LLM releases in 2024 after all... Alibaba's Qwen just released QvQ, a "visual reasoning model" - the same chain-of-thought trick as OpenAI's o1 but applied strictly to running a prompt against an image
I've been trying it out and it's a lot of fun to poke around with: https://simonwillison.net/2024/Dec/24/qvq/
Here's what it said when I asked it to count those pelicans