Did you know Google’s Gemini 1.5 Pro vision LLM is trained to return bounding boxes for objects found within images?
I built this browser tool that lets you run a prompt with an image against Gemini and visualize the bounding boxes
You can try it out using your own Google Gemini API key: https://tools.simonwillison.net/gemini-bbox
I built it with Claude 3.5 Sonnet, initially as an Artifact prototype, then upgraded to a standalone browser app that talks directly to the (CORS) Gemini API, storing the user’s API key in localStorage
Full details and prompt transcripts showing how I built it are here on my blog: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/