So far I've run Qwen2.5-Coder-32B successfully in two different ways: once via Ollama (and the llm-ollama plugin) and once using Apple's MLX framework and mlx-llm - details on how I ran both of those are in my article.
Top-level
So far I've run Qwen2.5-Coder-32B successfully in two different ways: once via Ollama (and the llm-ollama plugin) and once using Apple's MLX framework and mlx-llm - details on how I ran both of those are in my article. 7 comments
@simon Your post mentioned a ~20GB quantized file via Ollama; did that take up 20GB of RAM or 32? I’m waiting on delivery this/early next week of a 48GB M4 Pro which is why I'm kinda curious. @edmistond I just tried running a prompt through the Ollama qwen2.5-coder:32b model and to my surprise it appeared to peak at just 2GB of RAM usage, but it was using 95% of my GPU I thought GPU and system RAM were shared on macOS so I don't entirely understand what happened there, I would have expected more like 20GB of RAM use @simon Interesting, thanks for checking! Either way, since I currently work on a 16GB M1 with no problems for my day to day tools, I know I should have enough RAM to run my normal tools plus that for experimentation. 🙂 Added an example showing Qwen 2.5 Coder's performance on my "pelican on a bicycle" benchmark: llm -m qwen2.5-coder:32b 'Generate an SVG of a pelican riding a bicycle' It's not the *worst* I've seen! https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/ @simon there is something to be said about generating bad SVG graphics for things. With a different color palette, I have seen worse art on paper cups and hanging in offices. Could easily work for project release artwork. |
Here's a one-liner that should work for you if you run uv on a Mac with 64GB of RAM (it will download ~32GB of model the first time you run it)
uv run --with mlx-lm \
mlx_lm.generate \
--model mlx-community/Qwen2.5-Coder-32B-Instruct-8bit \
--max-tokens 4000 \
--prompt 'write me a python function that renders a mandelbrot fractal as wide as the current terminal'