Skip to content(if available)orjump to list(if available)

Llama.cpp supports Vulkan. why doesn't Ollama?

lolinder

So many here are trashing on Ollama, saying it's "just" nice porcelain around llama.cpp and it's not doing anything complicated. Okay. Let's stipulate that.

So where's the non-sketchy, non-for-profit equivalent? Where's the nice frontend for llama.cpp that makes it trivial for anyone who wants to play around with local LLMs without having to know much about their internals? If Ollama isn't doing anything difficult, why isn't llama.cpp as easy to use?

Making local LLMs accessible to the masses is an essential job right now—it's important to normalize owning your data as much as it can be normalized. For all of its faults, Ollama does that, and it does it far better than any alternative. Maybe wait to trash it for being "just" a wrapper until someone actually creates a viable alternative.

chown

I totally agree with this. I wanted to make it really easy for non-technical users with an app that hid all the complexities. I basically just wanted to embed the engine without making users open their terminal, let alone make them configure. I started with llama.cpp amd almost gave up on the idea before I stumbled upon Ollama, which made the app happen[1]

There are many flaws in Ollama but it makes many things much easier esp. if you don’t want to bother building and configuring. They do take a long time to merge any PRs though. One of my PRs has been waiting for 8 months and there was this another PR about KV cache quantization that took them 6 months to merge.

[1]: https://msty.app

traverseda

>So where's the non-sketchy, non-for-profit equivalent?

Serving models is currently expensive. I'd argue that some big cloud providers have conspired to make egress bandwidth expensive.

That, coupled with the increasing scale of the internet, make it harder and harder for smaller groups to do these kinds of things. At least until we get some good content addressed distributed storage system.

airstrike

[delayed]

buyucu

supporting vulkan will help ollama reach the masses who don't have dedicated gpus from nvidia.

this is such a low hanging fruit that it's silly how they are acting.

lolinder

As has been pointed out in this thread in a comment that you replied to (so I know you saw it) [0], Ollama goes to a lot of contortions to support multiple llama.cpp backends. Yes, their solution is a bit of a hack, but it means that the effort to adding a new back end is substantial.

And again, they're doing those contortions to make it easy for people. Making it easy involves trade-offs.

Yes, Ollama has flaws. They could communicate better about why they're ignoring PRs. All I'm saying is let's not pretend they're not doing anything complicated or difficult when no one has been able to recreate what they're doing.

[0] https://news.ycombinator.com/item?id=42886933

buyucu

This is incorrect. The effort it took to enable Vulkan was relatively minor. The PR is short and to be honest it doesn't do much, because it doesn't need to.

bestcoder69

lolinder

Llamafile is great but solves a slightly different problem very well: how do I easily download and run a single model without having any infrastructure in place first?

Ollama solves the problem of how I run many models without having to deal with many instances of infrastructure.

Havoc

ollama was good initially in that it made LLMs more accessible for non-technical people while everyone was figuring things out.

Lately they seem to be contributing mostly confusion to the conversation.

The #1 model the entire world is talking about is literally mislabeled their side. There is no such thing as R1-1.5b. Quantization without telling users also confuses noobs as to what is possible. Setting up an api different from the thing they're wrapping adds chaos. And claiming each feature added llama.cpp as something "ollama now supports" is exceedingly questionable especially when combined with the very sparse acknowledgement that it's a wrapper at all.

Whole thing just doesn't have good vibes

dingocat

What do you mean there is no such thing as R1-1.5b? DeepSeek released a distilled version based on a 1.5B Qwen model with the full name DeepSeek-R1-Distill-Qwen-1.5B, see chapter 3.2 on page 14 of their research article [0].

[0] https://arxiv.org/abs/2501.12948

trissi1996

Which is not the same model, it's not R1 it's R1-Distill-Qwen-1.5B....

cedws

The way Ollama has basically been laundering llama.cpp’s features as its own felt dodgy, this appears to confirm there’s something underhanded going on.

buyucu

I did not assume the worst when submitting the post, but that is also my suspicion. The whole thing is very dodgy.

bloomingkales

Are closer to the metal AI developers an under tracked bottle neck? AMD and Intel can barely get off the ground due to lagging software developers.

jvanderbot

This is where I want to work. But I feel like an AI swe is more likely to go "down" than an AI company is likely to hire me, a guy who loves optimizing pipelines for parallelism.

zozbot234

Metal is an Apple thing, not Intel or AMD. (And Ollama supports that.)

moffkalast

Ollama is a private for profit company, of course there's something shady going on.

ethbr1

Ollama is a private for profit AI company, of course there's something shady going on.

Because apparently you can take unethical business practices, add AI, and suddenly it's a whole new thing that no one can judge!

moffkalast

Well yes, though I was thinking more that they have no clear line way to get income besides VCs and need to figure out a way to monetize in some weird way eventually. I would not have predicted them taking Nvidia money to axe AMD compatibility though lol.

andy_ppp

It would be extremely unsurprising if Nvidia was funding this embrace and extend behind the scenes.

parineum

It would be pretty surprising to their shareholders if Nvidia was hiding where it was spending it's money.

buyucu

llama.cpp has supported vulkan for more than a year now. For more than 6 months now there has been an open PR to add vulkan backend support for Ollama. However, Ollama team has not even looked at it or commented on it.

Vulkan backends are existential for running LLMs on consumer hardware (iGPUs especially). It's sad to see Ollama miss this opportunity.

Kubuxu

Don’t be sad for commercial entity that is not a good player https://github.com/ggerganov/llama.cpp/pull/11016#issuecomme...

andy_ppp

This is great, I did not know about RamaLama and I'll be using and recommending that in future and if I see people using Ollama in instructions I'll recommend they move to RamaLama in the future. Cheers.

jdright

Yeah, I would love an actual alternative to Ollama, but RamaLama is not it unfortunately. As the other commenter said, onboarding is important. I just want one operation install and it needs to work and the simple fact RamaLama is written in Python, assures it will never be that easy, and this is even more true with LLM stuff when using AMD gpu.

I know there will be people that disagree with this, that's ok. This is my personal experience with Python in general, and 10x worse when I need to figure out all compatible packages with specifc ROCm support for my GPU. This is madness, even C and C++ setup and build is easier than this Python hell.

api

This is fascinating. I’ve been using ollama with no knowledge of this because it just works without a ton of knobs I don’t feel like spending the time to mess with.

As usual, the real work seems to be appropriated by people who do the last little bit — put an acceptable user experience and some polish on it — and they take all the money and credit.

It’s shitty but it also happens because the vast majority of devs, especially in the FOSS world, do not understand or appreciate user experience. It is bar none the most important thing in the success of most things in computing.

My rule is: every step a user has to do to install or set up something halves adoption. So if 100 people enter and there are two steps, 25 complete the process.

For a long time Apple was the most valuable corporation on Earth on the basis of user experience alone. Apple doesn’t invent much. They polish it, and that’s where like 99% of the value is as far as the market is concerned.

The reason is that computers are very confusing and hard to use. Computer people, which most of us are, don’t see that because it’s second nature to us. But even for computer people you get to the point where you’re busy and don’t have time to nerd out on every single thing you use, so it even matters to computer people in the end.

bearjaws

It's hilarious that docker guys are trying to take another OSS and monetize it. Hey if it worked once?...

buyucu

I was not aware of this context, thanks!

n144q

Thanks, just yesterday I discovered that Ollama could not use iGPU on my AMD machine, and was going through a long issue for solutions/workarounds (https://github.com/ollama/ollama/issues/2637). Existing instructions are based on Linux, and some people found it utterly surprising that anyone wants to run LLMs on Windows (really?). While I would have no trouble installing Linux and compile from source, I wasn't ready to do that to my main, daily-use computer.

Great to see this.

PS. Have you got feedback on whether this works on Windows? If not, I can try to create a build today.

zozbot234

The PR has been legitimately out-of-date and unmergeable for many months. It was forward-ported a few weeks ago, and is now still awaiting formal review and merging. (To be sure, Vulkan support in Ollama will likely stay experimental for some time even if the existing PR is merged, and many setups will need manual adjustment of the number of GPU layers and such. It's far from 100% foolproof even in the best-case scenario!)

For that matter, some people are still having issues building and running it, as seen from the latest comments on the linked GitHub page. It's not clear that it's even in a fully reviewable state just yet.

buyucu

this pr was reviewable multiple times, rebased multiple times. all because ollama team kept ignoring it. it has been open for almost 7 months now without a single comment from the ollama folks.

9cb14c1ec0

The PR at issue here blocks iGPUs. My fork of the PR changes removes that:

https://github.com/9cb14c1ec0/ollama-vulkan

I successfully ran Phi4 on my AMD Ryzen 7 PRO 5850U iGPU with it.

buyucu

this is great! I think pufferfish is taking PRs to his fork as well.

a12k

Ollama is sketchy enough that I run it in a VM. Which is odd because it would probably take less effort to just run Llama.cpp directly, but VMs are pretty easy so just went that route.

When I see people bring up the sketchiness most of the time the creator responds with the equivalent of shrugs, which imo increases the sketchiness.

n144q

Care to elaborate what "sketchy" refers to here?

nicce

> but VMs are pretty easy so just went that route.

Don’t you need at least 2 GPUs in that case and put kernel level passthrough?

a12k

I don’t use GPU. Works fine, but the large Mixtral models are slow.

bdhcuidbebe

i pass through my dGPU to VM and use iGPU for desktop

nialv7

It's fully open source. I mean yes it uses llama.cpp without giving it credit. But why run it in a VM?

krowek

> But why run it in a VM?

Because you don't execute untrusted code in your machine without containerization/virtualization. Don't you?

Aurornis

The question was asking why it’s untrusted code, not why you run untrusted code in a VM.

There are a lot of open-source tools that we have to trust to get anything done on a daily basis.

adastra22

Every single day. There's just too much good software out there, and life is too short to be so paranoid.

instagary

Isn't there a clause in MIT that says you're required to give credit? Also, I didn't know a YC company which started it: https://www.ycombinator.com/companies/ollama.

aduffy

The project existed in the open source and then subsequently the creators sought funding to work on it full time.

a12k

It severely over-permissions itself on my Mac.

celeritascelery

I have never had it request any permissions

null

[deleted]

super_mario

Can you please elaborate? How are you running ollama? I just build it from source and have written a shell script to start/stop it. It runs under my local user account (I should probably have its own user) and is of course not exposed outside localhost.

buyucu

ollama advertising llama.cpp features as their own is very dishonest in my opinion.

portaouflop

That’s the curse and blessing of open source I guess? I have billion dollar companies running my oss software without giving me anything - but do I gripe about it in public forums? Yea maybe sometimes but it never helps to improve the situation.

weinzierl

It's the curse of permissively licensed open source. Copyleft is not the answer to everything but against companies leeching and not giving back it is effective.

sitkack

Are they a wrapper with a similar name? You, like I, do gripe in public forums.

adastra22

Welcome to open source.

the_mitsuhiko

Ollama needs competition. I’m not sure what drives the people that maintain it but some of their actions imply that there are ulterior motives at play that do not have the benefit of their users in mind.

However such projects require a lot of time and effort and it’s not clear if this project can be forked and kept alive.

Deathmax

The most recent one of the top of my head is their horrendous aliasing of DeepSeek R1 on their model hub, misleading users into thinking they are running the full model but really anything but the 671b alias is one of the distilled models. This has already led to lots of people claiming that they are running R1 locally when they are not.

TeMPOraL

The whole DeepSeek-R1 situation gets extra confusing because:

- The distilled models are also provided by DeepSeek;

- There's also dynamic quants of (non-distilled) R1 - see [0]. Those, as I understand it, are more "real R1" than the distilled models, and you can get as low as ~140GB file size with the 1.58-bit quant.

I actually managed to get the 1.58-bit dynamic quant running on my personal PC, with 32GB RAM, at about 0.11 tokens per second. That is, roughly six tokens per minute. That was with llama.cpp via LM Studio; using Vulkan for GPU offload (up to 4 layers for my RTX 4070 Ti with 12GB VRAM :/) actually slowed things down relative to running purely on the CPU, but either way, it's too slow to be useful with such specs.

--

[0] - https://unsloth.ai/blog/deepseekr1-dynamic

zozbot234

> it's too slow to be useful with such specs.

Only if you insist on realtime output: if you're OK with posting your question to the model and letting it run overnight (or, for some shorter questions, over your lunch break) it's great. I believe that this use case can fit local-AI especially well.

adastra22

I'm not sure that's fair, given that the distilled models are almost as good. Do you really think Deepseek's web interface is giving you access to 671b? They're going to be running distilled models there too.

zozbot234

Given that the 671B model is reportedly MoE-based, it definitely could be powering the web interface and API. MoE slashes the per-inference compute cost - and when serving the model for multiple users you only have to host a single copy of the model params in memory, so the bulk doesn't hurt you as much.

blixt

LM Studio has been around for a long time and does a lot of similar things but with a more UI-based approach. I used to use it before Ollama, and seems it's still going strong. https://lmstudio.ai/

buyucu

isn't lm stuido closed source?

7thpower

Can you please explain why you think they may be operating in bad faith?

diggan

Not parent, but same feeling.

First I got the feeling because of how they store things on disk and try to get all models rehosted in their own closed library.

Second time I got the feeling is when it's not obvious at all about what their motives are, and that it's a for-profit venture.

Third time is trying to discuss things in their Discord and the moderators there constantly shut down a lot of conversation citing "Misinformation" and rewrites your messages. You can ask a honest question, it gets deleted and you get blocked for a day.

Just today I asked why the R1 models they're shipping that are the distilled ones, doesn't have "distilled" in the name, or even any way of knowing which tag is which model, and got the answer "if you don't like how things are done on Ollama, you can run your own object registry" which doesn't exactly inspire confidence.

Another thing I noticed after a while is that there are bunch of people with zero knowledge of terminals that want to run Ollama, even though Ollama is a project for developers (since you do need to know how to run a terminal). Just making the messaging clearer would help a lot in this regarding, but somehow the Ollama team thinks thats gatekeeping and it's better to teach people basic terminal operations.

prabir

There is https://cortex.so/ that I’m looking forward too.

adastra22

Hey thanks, I didn't know about cortex and this looks perfect.

buyucu

I totally agree that ollama needs competition. They have been doing very sketchy things lately. I wish llama.cpp had an alternative wrapper client like ollama.

Liquix

agreed. but what's wrong with Jan? does ollama utilize resources/run models more efficiently under the hood? (sorry for the naivete)

null

[deleted]

benxh

My biggest gripe with Ollama is the badly named models, e.g. under deepseek-r1, it defaults to the distill models.

buyucu

I agree they should rename them.

But defaulting to a 671b model is also evil.

rfoo

No. If you can't run it and most people can never run the model on their laptop, it's fine, let people know the fact, instead of giving them illusion.

Mashimo

Letting people download 400GB just to find that out is also .. not optimal.

But yes, I have been "yelled" at on reddit for telling people you need vram in the hundreds of GB.

singularity2001

at least the distilled models are officially provided by deepseek (?)

trash_cat

I use Ollama because I am a casual user and can't be bothered to read the docs on how to setup llama.cpp. I just want to run a simple llm locally.

Why would I care about Vulkan?

buyucu

with vulkan it runs much much faster on consumer hardware, especially opn igpus like intel or amd.

zozbot234

Well, it definitely runs faster on external dGPU's. With iGPU's and possibly future NPU's, the pre-processing/"thinking" phase is much faster (because that one is compute-bound) but text generation tends to be faster on CPU because it makes better use of available memory bandwidth (which is the relevant constraint there). iGPU's and NPU's will still be a win wrt. energy use, however.

bdhcuidbebe

For Intel, OpenVINO should be the preferred route. I dont follow AMD, but Vulkan is just the common denominator here.

buyucu

If you support Vulkan, you support almost every GPU out there in the consumer market across all hardware vendors. It's an amazing fallback option.

I agree they should also support OpenVINO, but compared to Vulkan OpenVINO is a tiny market.

your_challenger

I don't know why one would use Ollama instead of llama.cpp. llama.cpp is so easy to use and the maintainer is pretty famous and active in the community.

buyucu

Llama.cpp dropped support for multimodal vlms. That is why I am using ollama. I would happily switch back if I could.

Gracana

llama.cpp readme still lists multimodal models.. Qwen2-VL and others. Is that inaccurate, or something different?

[edit] Oh I see, here's an issue about it: https://github.com/ggerganov/llama.cpp/issues/8010

buyucu

it's a grey zone but vlms are effectively not being developed anymore.

mschwaig

Ollama tries to appeal to a lowest common denominator user base, who does not want to worry about stuff like configuration and quants, or which binary to download.

I think they want their project to be smart enough to just 'figure out what to do' on behalf of the user.

That appeals to a lot of people, but I think them stuffing all backends into one binary and auto-detecting at runtime which to use and is actually a step too far towards simplicity.

What they did to support both CUDA and ROCm using the same binary looked quite cursed last time I checked (because they needed to link or invoke two different builds of llama.cpp of course).

I have only glanced at that PR, but I'm guessing that this plays a role in how many backends they can reasonably try to support.

In nixpkgs it's a huge pain that we configure quite deliberately what we want Ollama to do at build time, and then Ollama runs off and does whatever anyways, and users have to look at log output and performance regressions to know what it's actually doing, every time they update their heuristics for detecting ROCm. It's brittle as hell.

buyucu

I disagree with this, but it's a reasonable argument. The problem is that the Ollama team has basically ignored the PR, instead of engaging the community. The least they can do is to explain their reasoning.

This PR is #1 on their repo based on multiple metrics (comments, iterations, what have you)

paradite

Could it be that supporting multiple platforms open up more support tickets and adds more work to keep the software working on those new platforms?

As someone who built apps for Windows, Linux, macOS, iOS and Android, it is not trivial to ensure your new features or updates work on all platforms, and you have to deal with deprecations.

geerlingguy

They already support ROCm, which probably introduces 10x more support requests than Vulkan would!

buyucu

ollama is not doing anything. llama cpp does all that work. ollama is just a small wrapper on top.

zozbot234

This is not quite correct. Ollama must assess the state of Vulkan support and amount of available memory, then pick the fraction of the model to be hosted on GPU. This is not totally foolproof and will likely always need manual adjustment in some cases.

buyucu

the work involved is tiny compared to the work llama.cpp did to get vulkan up and running.

this is not rocket science.

paradite

Ok assuming what you said is correct, why wouldn't Ollama then be able to support Vulkan by default out of the box?

Sorry I'm not sure what's the relationship exactly between the two projects. This is a genuine questions, not a troll question.

buyucu

check the PR, it's a very short one. It's not more complicated than setting a compile time flag.

I have no idea why they have been ignoring it.

Ollama is just a friendly front end for llama.cpp. It doesn't have to do any of those things you mentioned. Llama.cpp does all that.