Email or username:

Password:

Forgot your password?
Simon Willison

Did you know Google’s Gemini 1.5 Pro vision LLM is trained to return bounding boxes for objects found within images?

I built this browser tool that lets you run a prompt with an image against Gemini and visualize the bounding boxes

You can try it out using your own Google Gemini API key: tools.simonwillison.net/gemini

12 comments
Simon Willison

I built it with Claude 3.5 Sonnet, initially as an Artifact prototype, then upgraded to a standalone browser app that talks directly to the (CORS) Gemini API, storing the user’s API key in localStorage

Full details and prompt transcripts showing how I built it are here on my blog: simonwillison.net/2024/Aug/26/

Simon Willison

How I built this is a pretty good illustration of how convoluted my workflow for getting useful results out of LLMs has become - I used Claude 3.5 Sonnet to build a web app for talking to Gemini 1.5 Pro, and fired up GPT-4o with Code Interpreter to help debug a weird JPEG issue

Jon Gilbert

@simon ...in this example, the left-goat bounding box looks quite off?

Jon Gilbert

@simon the left goat's ymin,xmin is reported to be 137,134 respectively, but visually the red boundary looks to be more like 340,140 ish? Further, its ymax,xmax is 653,375, but visually appears to be 870,375?

Simon Willison

@jgilbert oh! Yeah I think I’m drawing the axis ticks upside down for the vertical axis, it counts from 0 at the top to 1000 at the bottom

Jon Gilbert

@simon ah HA, I knew there was a reason that was better than "I haven't had coffee yet today but it sure looks wack".

Simon Willison

@jgilbert I shipped a fix for the tool, the axis display correctly now - thanks!

Adrien Delessert

@simon Thanks for this! I've just started working on a project that needs to both generate bounding boxes and extract some qualitative information from images—hopefully Gemini can be a one stop shop for that, rather than stringing things together like I'd started to do.

Microsoft has docs on a GPT4+"Enhancements" vision model with grounding/bounding boxes, but when you get into their dashboard it seems like it's actually deprecated. 🙄

Go Up