Did you know Google’s Gemini 1.5 Pro vision LLM is...

Simon's posts Post Back to profile

Simon Willison

Did you know Google’s Gemini 1.5 Pro vision LLM is trained to return bounding boxes for objects found within images?

I built this browser tool that lets you run a prompt with an image against Gemini and visualize the bounding boxes

You can try it out using your own Google Gemini API key: https://tools.simonwillison.net/gemini-bbox

Like 26 August at 16:54 | Open on fedi.simonwillison.net

12 comments

Simon Willison

I built it with Claude 3.5 Sonnet, initially as an Artifact prototype, then upgraded to a standalone browser app that talks directly to the (CORS) Gemini API, storing the user’s API key in localStorage

Full details and prompt transcripts showing how I built it are here on my blog: https://simonwillison.net/2024/Aug/26/gemini-bounding-box-visualization/

26 August at 16:56 | Open on fedi.simonwillison.net

Simon Willison

How I built this is a pretty good illustration of how convoluted my workflow for getting useful results out of LLMs has become - I used Claude 3.5 Sonnet to build a web app for talking to Gemini 1.5 Pro, and fired up GPT-4o with Code Interpreter to help debug a weird JPEG issue

26 August at 17:25 | Open on fedi.simonwillison.net

Tim Kellogg

@simon and they say AI won’t create jobs!

26 August at 18:47 | Open on hachyderm.io

Simon Willison

Sent this out with a bunch of other quotes and links in my latest newsletter https://simonw.substack.com/p/building-a-tool-showing-how-gemini

26 August at 22:46 | Open on fedi.simonwillison.net

Jon Gilbert

@simon ...in this example, the left-goat bounding box looks quite off?

26 August at 17:01 | Open on mastodon.social

Simon Willison

@jgilbert looks good to me, what’s off about it?

26 August at 17:03 | Open on fedi.simonwillison.net

Jon Gilbert

@simon the left goat's ymin,xmin is reported to be 137,134 respectively, but visually the red boundary looks to be more like 340,140 ish? Further, its ymax,xmax is 653,375, but visually appears to be 870,375?

26 August at 17:13 | Open on mastodon.social

Simon Willison

@jgilbert oh! Yeah I think I’m drawing the axis ticks upside down for the vertical axis, it counts from 0 at the top to 1000 at the bottom

26 August at 17:16 | Open on fedi.simonwillison.net

Jon Gilbert

@simon ah HA, I knew there was a reason that was better than "I haven't had coffee yet today but it sure looks wack".

26 August at 17:21 | Open on mastodon.social

Simon Willison

@jgilbert I shipped a fix for the tool, the axis display correctly now - thanks!

26 August at 17:46 | Open on fedi.simonwillison.net

Grant Custer

@simon nice! i've got one here too :) https://gemini-spatial-example.grantcuster.com/

that TIFF bug/trick is interesting!

26 August at 17:18 | Open on mastodon.social

Adrien Delessert

@simon Thanks for this! I've just started working on a project that needs to both generate bounding boxes and extract some qualitative information from images—hopefully Gemini can be a one stop shop for that, rather than stringing things together like I'd started to do.

Microsoft has docs on a GPT4+"Enhancements" vision model with grounding/bounding boxes, but when you get into their dashboard it seems like it's actually deprecated. 🙄

26 August at 18:22 | Open on infosec.exchange