Skip to content(if available)orjump to list(if available)

Kimi K2 1T model runs on 2 512GB M3 Ultras

A_D_E_P_T

Kimi K2 is a really weird model, just in general.

It's not nearly as smart as Opus 4.5 or 5.2-Pro or whatever, but it has a very distinct writing style and also a much more direct "interpersonal" style. As a writer of very-short-form stuff like emails, it's probably the best model available right now. As a chatbot, it's the only one that seems to really relish calling you out on mistakes or nonsense, and it doesn't hesitate to be blunt with you.

I get the feeling that it was trained very differently from the other models, which makes it situationally useful even if it's not very good for data analysis or working through complex questions. For instance, as it's both a good prose stylist and very direct/blunt, it's an extremely good editor.

I like it enough that I actually pay for a Kimi subscription.

wasting_time

It's also the only model that consistently nails my favorite AI benchmark: https://clocks.brianmoore.com/

tootie

I use that one for image gen too. Ask for a picture of a grandfather clock at a specific time. Most are completely unable. Clocks are always 10:20 because that's the most photogenic time used in most stock photos.

amelius

But how sure are we that it wasn't trained on that specifically?

jug

And given this, it unsurprisingly scores very well on https://eqbench.com

stingraycharles

> As a chatbot, it's the only one that seems to really relish calling you out on mistakes or nonsense, and it doesn't hesitate to be blunt with you.

My experience is that Sonnet 4.5 does this a lot as well, but this is more often than not due to a lack of full context, eg accusing the user of not doing X or Y when it just wasn’t told that was already done, and proceeding to apologize.

How is Kimi K2 in this regard?

Isn’t “instruction following” the most important thing you’d want out of a model in general, and a model pushing back more likely than not being wrong?

Kim_Bruning

> Isn’t “instruction following” the most important thing you’d want out of a model in general,

No. And for the same reason that pure "instruction following" in humans is considered a form of protest/sabotage.

https://en.wikipedia.org/wiki/Work-to-rule

SkyeCA

It's insanity to me that doing your job exactly as defined and not giving away extra work is considered a form of action.

Everyone should be working-to-rule all the time.

stingraycharles

I don’t understand the point you’re trying to make. LLMs are not humans.

From my perspective, the whole problem with LLMs (at least for writing code) is that it shouldn’t assume anything, follow the instructions faithfully, and ask the user for clarification if there is ambiguity in the request.

I find it extremely annoying when the model pushes back / disagrees, instead of asking for clarification. For this reason, I’m not a big fan of Sonnet 4.5.

logicprog

How do you feel K2 Thinking compares to Opus 4.5 and 5.2-Pro?

jug

? The user directly addresses this.

Kim_Bruning

Speaking of weird. I feel like Kimi is a shoggoth with its tentacles in a man-bun. If that makes any sense.

mehdibl

Claims as always misleading as they don't show the context length or prefill if you use a lot of context. As it will be fun waiting minutes for a reply.

Kim_Bruning

Kimi K2 is a very impressive model! It's particularly un-obsequious, which makes it useful for actually checking your reasoning on things.

Some especially older ChatGPT models will tell you that everything you say is fantastic and great. Kimi -on the other hand- doesn't mind taking a detour to question your intelligence and likely your entire ancestry if you ask it to be brutal.

diydsp

Upon request cg roasts. Good for reducing distractions.

websiteapi

I get tempted to buy a couple of these, but I just feel like the amortization doesn’t make sense yet. Surely in the next few years this will be orders of magnitude cheaper.

NitpickLawyer

Before committing to purchasing two of these, you should look at the true speeds that few people post. Not just the "it works". We're at a point where we can run these very large models "at home", and it is great! But true usage is now with very large contexts, both in prompt processing, and token generations. Whatever speeds these models get at "0" context is very different than what they get at "useful" context, especially in coding and such.

stingraycharles

I don’t think it will ever make sense; you can buy so much cloud based usage for this type of price.

From my perspective, the biggest problem is that I am just not going to be using it 24/7. Which means I’m not getting nearly as much value out of it as the cloud based vendors do from their hardware.

Last but not least, if I want to run queries against open source models, I prefer to use a provider like Groq or Cerebras as it’s extremely convenient to have the query results nearly instantly.

null

[deleted]

lordswork

As long as you're willing to wait up to an hour for your GPU to get scheduled when you do want to use it.

stingraycharles

I don’t understand what you’re saying. What’s preventing you from using eg OpenRouter to run a query against Kimi-K2 from whatever provider?

websiteapi

my issue is once you have it in your workflow I'd be pretty latency sensitive. imagine those record-it-all apps working well. eventually you'd become pretty reliant on it. I don't want to necessarily be at the whims of the cloud

givinguflac

I think you’re missing the whole point, which is not using cloud compute.

stingraycharles

Because of privacy reasons? Yeah I’m not going to spend a small fortune for that to be able to use these types of models.

chrsw

The only reason why you run local models is for privacy, never for cost. Or even latency.

websiteapi

indeed - my main use case is those kind of "record everything" sort of setups. I'm not even super privacy conscious per se but it just feels too weird to send literally everything I'm saying all of the time to the cloud.

luckily for now whisper doesn't require too much compute, bu the kind of interesting analysis I'd want would require at least a 1B parameter model, maybe 100B or 1T.

andy99

Autonomy generally, not just privacy. You never know what the future will bring, AI will be enshittified and so will hubs like huggingface. It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

Aurornis

> You never know what the future will bring, AI will be enshittified and so will hubs like huggingface.

If anyone wants to bet that future cloud hosted AI models will get worse than they are now, I will take the opposite side of that bet.

> It’s useful to have an off grid solution that isn’t subject to VCs wanting to see their capital returned.

You can pay cloud providers for access to the same models that you can run locally, though. You don’t need a local setup even for this unlikely future scenario where all of the mainstream LLM providers simultaneously decided to make their LLMs poor quality and none of them sees this as market opportunity to provide good service.

But even if we ignore all of that and assume that all of the cloud inference everywhere becomes bad at the same time at some point in the future, you would still be better off buying your own inference hardware at that point in time. Spending the money to buy two M3 Ultras right now to prepare for an unlikely future event is illogical.

The only reason to run local LLMs is if you have privacy requirements or you want to do it as a hobby.

chrsw

Yes, I agree. And you can add security to that too.

amelius

Maybe wait until RAM prices have normalized again.

alwillis

Hopefully the next time it’s updated, it should ship with some variant of the M5.

Alifatisk

You should mention that it is 4bit quant. Still very impressive!

geerlingguy

Kiki K2 was made to be optimized at 4-bit, though.

natrys

That's the Kimi K2 Thinking, this post seems to be talking about original Kimi K2 Instruct though, I don't think INT4 QAT (quantization aware training) version was released for this.