Vision Now Available in Llama.cpp

danielhanchen

It works super well!

You'll have to compile llama.cpp from source, and you should get a llama-mtmd-cli program.

I made some quants with vision support - literally run:

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL -ngl 99

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-12b-it-GGUF:Q4_K_XL -ngl 99

./llama.cpp/llama-mtmd-cli -hf unsloth/gemma-3-27b-it-GGUF:Q4_K_XL -ngl 99

./llama.cpp/llama-mtmd-cli -hf unsloth/unsloth/Mistral-Small-3.1-24B-Instruct-2503-GGUF:Q4_K_XL -ngl 99

Then load the image with /image image.png inside the chat, and chat away!

raffraffraff

I can't see the letters "ngl" anymore without wanting to punch something.

danielhanchen

Oh it's shorthand for number of layers to offload to the GPU for faster inference :) but yes it's probs not the best abbreviation.

blowsand

frfr

thenameless7741

If you install llama.cpp via Homebrew, llama-mtmd-cli is already included. So you can simply run `llama-mtmd-cli <args>`

danielhanchen

Oh even better!!

danielhanchen

If it helps, I updated https://docs.unsloth.ai/basics/gemma-3-how-to-run-and-fine-t... to show you can use llama-mtmd-cli directly - it should work for Mistral Small as well

null

[deleted]

null

[deleted]

banana_giraffe

I used this to create keywords and descriptions on a bunch of photos from a trip recently using Gemma3 4b. Works impressively well, including going doing basic OCR to give me summaries of photos of text, and picking up context clues to figure out where many of the pictures were taken.

Very nice for something that's self hosted.

accrual

That's pretty neat. Do you essentially loop over a list of images and run the prompt for each, then store the result somewhere (metadata, sqlite)?

banana_giraffe

Yep, exactly, just looped through each image with the same prompt and stored the results in a SQLite database to search through and maybe present more than a simple WebUI in the future.

If you want to see, here it is:

https://gist.github.com/Q726kbXuN/f300149131c008798411aa3246...

Here's an example of the kind of detail it built up for me for one image:

https://imgur.com/a/6jpISbk

It's wrapped up in a bunch of POC code around talking to LLMs, so it's very very messy, but it does work. Probably will even work for someone that's not me.

simonw

This is the most useful documentation I've found so far to help understand how this works: https://github.com/ggml-org/llama.cpp/tree/master/tools/mtmd...

nico

How does this compare to using a multimodal model like gemma3 via ollama?

Any benefit on a Mac with apple silicon? Any experiences someone could share?

ngxson

Two things:

1. Because the support in llama.cpp is horizontal integrated within ggml ecosystem, we can optimize it to run even faster than ollama.

For example, pixtral/mistral small 3.1 model has some 2D-RoPE trick that use less memory than ollama's implementation. Same for flash attention (which will be added very soon), it will allow vision encoder to run faster while using less memory.

2. llama.cpp simply support more models than ollama. For example, ollama does not support either pixtral or smolvlm

gryfft

Seems like another step change. The first time I ran a local LLM on my phone and carried on a fairly coherent conversation, I imagined edge inference would take off really quickly at least with e.g. personal assistant/"digital waifu" business cases. I wonder what the next wave of apps built on Llama.cpp and its downstream technologies will do to the global economy in the next three months.

LPisGood

The “global economy in three month is writing some checks that I don’t know all of the recent AI craze has been able to cash in three years.

ijustlovemath

AI is fundamentally learning the entire conditional probability distribution of our collective knowledge; but sampling it over and over is not going to fundamentally enhance it, except to, perhaps, reinforce a mean, or surface places we have insufficiently sampled. For me, even the deep research agents aren't the best when it comes to surfacing truth, because the nuance of that is lost on the distribution.

I think that if we're realistic with ourselves, AI will become exponentially more expensive to train, but without additional high quality data (not you, synthetic data), we're back to 1980s era AI (expert systems), just with enhanced fossil fuel usage to keep up with the TPUs. What's old is new again, I suppose!

I sincerely hope to be proven wrong, of course, but I think recent AI innovation has stagnated in terms of new things it can do. It's a great tool, when you use it to leverage that distribution (eg, semantic search), but it might not fundamentally be the approach to AGI (unless your goal is to replicate what we can, but less spikey)

gryfft

It doesn't have to be AGI to have a major economic impact. It just has to beat enough extant CAPTCHA implementations.

MoonGhost

It's not as simple as stochastic parrot. Starting with definitions and axioms all theorems can be invented and proved. That's in theory, without having theorems in the training set. That's thinking models should be able to do without additional training and data.

In other words way forward seems to be to put models in loops. Which includes internal 'thinking' and external feedback. Make them use generated and acquired new data. Lossy compress the data periodically. And we have another race of algorithms.

behnamoh

didn't llama.cpp use to have vision support last year or so?

danielhanchen

Yes they always did, but they moved it all into 1 umbrella called "llama-mtmd-cli"!

HN

Vision Now Available in Llama.cpp

Vision Now Available in Llama.cpp