Skip to content(if available)orjump to list(if available)

Phi 4 available on Ollama

Phi 4 available on Ollama

39 comments

·January 9, 2025

sgk284

Over the holidays, we published a post[1] on using high-precision few-shot examples to get `gpt-4o-mini` to perform similar to `gpt-4o`. I just re-ran that same experiment, but swapped out `gpt-4o-mini` with `phi-4`.

`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!

By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).

[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like

crorella

It’s odd that MS is releasing models they are competitors to OA. This reinforce the idea that there is no real strategic advantage in owning a model. I think the strategy is now offer cheap and performant infra to run the models.

PittleyDunkin

> It’s odd that MS is releasing models they are competitors to OA.

> I think the strategy is now offer cheap and performant infra to run the models.

Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?

lolinder

That's exactly what they're saying: it's interesting that Microsoft came to the same confusion that Meta did, that models are generally not worth keeping locked down. It suggests that OpenAI has a very fragile business model, given that they're wholly dependent on large providers for the infra, which is apparently the valuable part of the equation.

krick

To be fair, OpenAI's products are not really models, they are... products. So it's debatable if they really do have anything special.

I don't really think they do, because to me it seemed pretty much since GPT-1, that having callbacks to run python and query google, having "inner dialog" before summarizing an answer and a dozen more simple improvements like this are quite obvious things to do, that nobody just actually implemented (yet). And if some of them are not obvious per se, they are pretty obvious in the hindsight. But, yeah, it's debatable.

I must admit though, that I doubt that this obvious weakness is not obvious to the stakeholders. I have no idea what the plan is, maybe what they gonna have that Anthropic doesn't is gonna be a nuclear reactor. Like, honestly, all we are pretending to be forward-thinking analysts here, but in reality I couldn't figure out that Musk's "investment" into Twitter is literally politics at the time of it happening. Even though I was sure there is some plan, I couldn't say what it is, and I don't remember anybody in these threads expressing clearly enough what is quite obvious in the hindsight. Neither did all these people like Matt Levine, who are actually paid for their shitposting: I mostly remember them making fun of Musk "doing stupid stuff and finding out" and calling it a "toy".

sumedh

> It suggests that OpenAI has a very fragile business model

That is the reason they are making products so that people stay on the platform.

m3kw9

They are releasing non sota models.

easton

I think they want/need a plan b in case OpenAI falls apart like it almost did when Sam got fired.

hbcondo714

FWIW, Phi-4 was converted to Ollama by the community last month:

https://ollama.com/vanilj/Phi-4

smallerize

And adopted unsloth's bug fixes a few days ago. https://ollama.com/vanilj/phi-4-unsloth

mythz

Was disappointed in all the Phi models before this, whose benchmark results scored way better than it worked in practice, but I've been really impressed with how good Phi-4 is at just 14B. We've run it against the top 1000 most popular StackOverflow questions and it came up 3rd beating out GPT-4 and Sonnet 3.5 in our benchmarks, only behind DeepSeek v3 and WizardLM 8x22B [1]. We're using Mixtral 8x7B to grade the quality of the answers which could explain how WizardLM (based on Mixtral 8x22B) took 2nd Place.

Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.

Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].

[1] https://pvq.app/leaderboard

[2] https://openrouter.ai/microsoft/phi-4

[3] https://pvq.app/questions/ask

KTibow

Interesting eval but my first reaction is "using Mixtral as a judge doesn't sound like a good idea". Have you tested how different its results are from GPT-4 as a judge (on a small scale) or how stuff like style and order can affect its judgements?

Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper

mythz

Yeah we evaluated several models for grading ~1 year ago and concluded Mixtral was the best choice for us, as it was the best model yielding the best results that we could self-host and distribute the load of grading 1.2M+ answers over several GPU Servers.

We would have liked to pick a neutral model like Gemini which was fast, reliable and low cost, unfortunately it gave too many poor answers good grades [1]. If we had to pick a new grading model now, hopefully the much improved Gemini Flash 2.0 might yield better results.

[1] https://pvq.app/posts/individual-voting-comparison#gemini-pr...

KTibow

There are a lot of interesting options. Gemini 2 Flash isn't ready yet (the current limits are 10 RPM and 1500 RPD) but it could definitely work. An alternative might be using a fine tuned model - I've heard good things about OpenAI fine tuning with even a few examples.

lolinder

Honestly, the fact that you used an LLM to grade the answers at all is enough to make me discount your results entirely. That it showed obvious preference to the model with which it shares weights is just a symptom of the core problem, which is that you had to pick a model to trust before you even ran the benchmarks.

The only judges that matter at this stage are humans. Maybe someday when we have models that humans agree are reliably good you could use them to judge lesser-but-cheaper models.

lhl

I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]

The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...

[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...

[2] https://github.com/google-research/google-research/blob/mast...

andhuman

I’ve seen on the localllama subreddit that some GGUFs have bugs in them. The one recommended was by unsloth. However, I don’t know how the Ollama GGUF holds up.

compumetrika

Ollama can pull directly from HF, you just provide the URL and add to the end :Q8_0 (or whatever) to specify your desired quant. Bonus: use the short form url of `hf` instead of `huggingface` to shorten the model name a little in the ollama list table.

Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:

`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`

(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)

jmorgan

Phi-4's architecture changed slightly from Phi-3.5 (it no longer uses a sliding window of 2,048 tokens [1]), causing a change in the hyperparameters (and ultimately an error at inference time for some published GGUF files on Hugging Face, since the same architecture name/identifier was re-used between the two models).

For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well

In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".

[1] https://arxiv.org/html/2412.08905v1

[2] https://github.com/ollama/ollama/releases/tag/v0.5.5

magicalhippo

Here's[1] a recent submission on that.

[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes

gnabgib

Related Phi-4: Microsoft's Newest Small Language Model Specializing in Complex Reasoning (439 points, 24 days ago, 144 comments) https://news.ycombinator.com/item?id=42405323

Also on hugging face https://huggingface.co/microsoft/phi-4

null

[deleted]

behnamoh

Aren't we as a community giving too much credit and power to one app (aka ollama) which so clearly rips the benefit of other open source projects like llama.cpp?

DrPhish

It's very hard to put into words without coming off as being unfair to one side or the other, but the ollama project really does provide little-to-no _innovative_ value over simply running components of llama.cpp directly from the command line. 100% of the heavy lifting (from an LLM perspective) is in the llama.cpp codebase. The ollama parts are all simple, well understood, commodity components that most any developer could have produced.

Now, applications like ollama obviously need to exist, as not everyone can run CLI utilities, let alone clone a git repo and compile themselves. Easy to use GUIs are essential for the adoption of new tech (much like how there are many apps that wrap ffmpeg and are mostly UI).

However, if ollama are mostly doing commodity GUI things over a fully fleshed-out, _unique_ codebase to which their very existence is owed, they should do everything in their power to point that out. I'm sure they're legally within their rights because of the licensing, but just from an ethical perspective.

I think there is a lot of ill-will towards ollama in some hard-core OG LLM communities because ollama appears to be attempting to capture the value that ggerganov has provided to the world in this tool without adequate attribution (although there is a small footnote, iirc). Basically, the debt that ollama owes to llama.cpp is so immense that they need to do a much better job recognizing it imo.

noodletheworld

The ollama application has zero value; it’s just an easy to use front end to their model hosting which is both what this is and why they’re important.

Only having one model host (hugging face) is bad for obvious reasons (and good in others, yes, but still)

Ollama offering an alternative as a model host seems quite reasonable and quite well implemented.

The frontend really is nothing; it’s just llama.cpp in a go wrapper. It has no value and it’s not really interesting, it’s simple stable technology that is perfectly fine to rely on and be totally unexcited or interested in, technically.

…but, they do a lot more than that; and I think it’s a little unfair to imply that trivial piece of their stack is all they do.

mythz

The software that controls the front-end has enormous value, it becomes the central point and brand to manage and self-host LLMs that's used to manage 100 GB catalog of models which acts like a moat inhibiting switching to alternatives. Awareness and user-base are the hardest things to obtain with new Software products and it has both - right now it doesn't look it's monetizing its user base, but it could easily attract millions in VC funding to spin off a company to sell support contracts and "higher value" SaaS hosting or enterprise management features.

Whilst it's now a UX friendly front-end for llama.cpp, it's also working on adding support for other backends like MLX [1].

[1] https://github.com/ollama/ollama/issues/1730

The_Amp_Walrus

I might be wrong about this but doesn't ollama do some work to ensure the model runs efficiently given your hardware? Like choosing between how much gpu memory to consume so you don't oom. Does llama.cpp do that for you with zero config?

porcoda

I’m not seeing what the issue is with ollama. Can you elaborate? There are tons of open source projects that other stuff gets built upon: that’s part of the point of open source.

digdugdirk

I thought ollama was just a convenience wrapper around llama.cpp?

magicalhippo

That might be how it started, but there are differences. For example support for LLama 3.2 Vision was added to Ollama[1], but not upstreamed[2] to llama.cpp due to image processing requirements AFIAK.

[1]: https://github.com/ollama/ollama/releases/tag/v0.4.0

[2]: https://github.com/ggerganov/llama.cpp/issues/9643

chamomeal

Looks like you’re being downvoted. It’d be nice if somebody could explain the difference, cause I’m also kinda out of the loop on this