Grok 4 Launch [video]

205 comments

·July 10, 2025

zone411

Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.

Grok 4 Heavy is not in the API.

sebzim4500

Very impressive, but what do you think the chances are that this was in the training data?

diggan

> but what do you think the chances are that this was in the training data?

Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.

SilverSlash

The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.

I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.

brookst

Who promised that there would be no advanced models with high costs?

Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.

worldsavior

Why number of GPUs is the problem and not the amount of GPUs usage? I don't think buying GPUs is the problem, but if you have tons of GPUs it can be very expensive. I presume that's the reason it's so expensive, especially with LLMs.

altbdoor

It's important to note that pricing for Gemini has been increasing too.

https://news.ycombinator.com/item?id=44457371

Havoc

It’s the inference time scaling - this is going to create a whole new level of have vs have nots split.

The vast majority of the world can’t afford 100s of dollars a month

pzo

also their api pricing is a little misleading - it only matches sonnet 4 pricing ($3/$15) only "for request under 128k" (whatever it means) but above that it's 2x more.

vessenes

That 128k is a reference to the context window — how many tokens you put in to the start. Presumably Grok 4 with 128k context window is running on less hardware (it needs much less RAM than 256k) and they route it accordingly internally.

42lux

It's because a lot of the advancements are post training the models themselves have stagnated. Look at the heavy "model"...

ignoramous

> Gemini 2.5 Pro for free ...

It is Google. So, I'd pay attention to data collection feeding back in to training or evaluation.

https://news.ycombinator.com/item?id=44379036

ljlolel

More of an issue of market share than # of gpus?

modeless

Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.

Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.

vessenes

Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.

That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.

On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.

esafak

I wish the coding models were available in coding agents. Haven't seem them anywhere.

vincent_s

Grok 4 is now available in Cursor.

mhoad

[flagged]

Workaccount2

Well we have GPT-5 and Gemini 3 in the wings so it wouldn't be surprising if it is SOTA for a few days.

monkeydust

yup this will probably trigger the next wave of releases, someone had to go first.

null

[deleted]

tibbar

The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.

EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.

icoder

I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).

But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.

cfn

Maybe this is the dawn of the multicore era for LLMs.

Voloskaya

> Expensive and slow

Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.

So if you can do it in prod for users paying 300$/month, it's a pretty good deal.

daniel_iversen

Very clever, thanks for mentioning this!

irthomasthomas

Like llm-consortium? But without the model diversity.

https://x.com/karpathy/status/1870692546969735361

https://github.com/irthomasthomas/llm-consortium

simianwords

that's how o3 pro also works IMO

bobjordan

I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.

tibbar

Interesting. I'd guess this technique should probably work with any SOTA model in an agentic tool loop. Fun!

zone411

This is the speculation, but then it wouldn't have to take much longer to answer than o3.

null

[deleted]

sidibe

You are making the mistake of taking one of Elon's presentations at face value.

tibbar

I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.

gitfan86

[flagged]

rpozarickij

Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response. I like Claude's approach (you need to tap in order to end the response), but it's not very reliable because sometimes it just abruptly cuts my response without waiting until I tap.

I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.

pzo

yes their voice mode is pretty good also works with Polish (much better than few months ago). I wish they had also option 'push to talk' (walkie talkie style with big button) similar like perplexity allow such mode or 'automatic'.

Also would be great if they added voice mode in browser (again like perplexity).

rpozarickij

> Also would be great if they added voice mode in browser

There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video. So perhaps they're working on this, but it's hidden from the public.

"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%."

"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."

https://x.com/arcprize/status/1943168950763950555

lexandstuff

Out of interest, has anyone ever integrated with Grok? I've done so many LLM integrations in the last few years, but never heard of anyone choosing Grok. I feel like they are going to need an unmistakably capable model before anyone would want to risk it - they don't behave like a serious company.

47thpresident

Grok 3 is on Azure AI Foundary [0] and announced an integration with Telegram, albeit they are paying Telegram $300m not vice versa [1]. But I agree, choosing Grok is just a huge reputational liability for anyone’s work that is serious.

[0] https://devblogs.microsoft.com/foundry/announcing-grok-3-and... [1] https://www.bbc.co.uk/news/articles/cdxvr3n7wlxo

Gigachad

[flagged]

wongarsu

There have been at least two instances of "unauthorized modifications" to the system prompt of the Grok model running wild in X, but if you build your own integration you would provide your own system prompt and be unaffected by that.

On the model side I've found Grok3 to be very unbiased. If you ask it to write a story it will somehow find a way to weave a mention of X/Twitter into that story, but other than that it is much less biased and moralizing than e.g. OpenAI models. It also has very lax guard rails, so that's something you'd probably want to add

I can't say yet whether all of this is still true for Grok 4

hhh

Are you asking it to write a story on like grok.com or inside of twitter, or are you saying that if I call the API and ask for a story I'm going to get twitter weaved in there somehow

raspasov

Grok has consistently been one of the best models I've used for deep research (no API use). Grok 4 looks even more promising.

spaceman_2020

Grok's Twitter integration has legitimately been one of the best use cases I've seen. Just being able to ask Grok right within the tweet about context or meaning of any jargon is very useful.

LorenDB

I think the Grok button that is present on tweets is the best way to ask Grok about tweets. Tagging @grok just spams others' timelines with useless AI responses. The Grok button lets you keep it private.

saagarjha

@grok is this true?

archagon

Particularly useful if you’re an antisemite or white supremacist, it seems.

sebzim4500

Until very recently, it was alt-right people getting frustrated that they couldn't get grok to confirm their delusions. They had tricks to get it to confirm their priors (esp. asking leading questions and demanding a single word response) but they didn't work that well.

moralestapia

While you're not wrong, I feel like they don't make up a significant chunk of @grok's queries. People usually talk about other topics.

falleng0d

[flagged]

FirmwareBurner

> deep research

Can you say what you mean by deep research?

repsak

Agent that browses the web, analyzes information, and creates reports. Grok calls it DeepSearch. Similar to gemini/openai deep research.

https://x.ai/news/grok-3#grok-agents-combining-reasoning-and...

looyd

Has anyone tried it for coding?

pmdr

Metrics aside, Grok model names make more sense than OpenAI. I've really lost track of which one is better and in which way.

lupusreal

OpenAI names models like people name word documents. Report-1, Report-2, Report-2a, Report-final, Report-final-final, Report-actually-final, Report-2a-final...

brookst

OpenAI has leapfrogged that kind of naming. If they did word docs they would be Report-2, Report-a2; Report2-a, Reporta-2.

TheAceOfHearts

Does anyone here have access to Grok 4 yet? If so, could you please try asking it to solve this basic word search problem [0] and share the results? It's just a simple grid of letters where you have to find the position of each word, the kind of problem that any young child can easily solve.

[0] https://imgur.com/VxNP5jG

modeless

They said they're training a new base model for better multimodal performance soon. I wouldn't expect it to be able to read an image like that today. Maybe if you provided it in text format.

Szpadel

description from openrouter:

> Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified.

unfortunately no requests are passing because of some rate limits

TheAceOfHearts

As a point of interest and for comparison, Gemini 2.5 Pro is able to generate a Python program that outputs the complete correct solution when run, but it can't figure out how to one-shot the problem if asked directly.

This is just a for-fun test to get a sense of how models are progressing; it highlights the jagged nature of their intelligence and capabilities. None of the big AI labs are testing for such a basic problem type, which makes it a bit of an interesting check.

I think it's still interesting to see how Grok 4 performs, even if we don't use this test to draw any broader conclusions about what capabilities it offers.

vnchr

Mix of hits and misses: https://x.com/i/grok/share/CWE4XhSUlqVe370CehF9At5Tc

kadushka

These models are not trained on character level input. Why would anyone expect them to perform well on character level puzzles?

Jensson

They are trained on many billions of tokens of text dealing with character level input, they would be rather dumb if they couldn't learn it anyway.

Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.

brookst

These models operate on tokens, not characters. It’s true that training budgets could be spent on exhaustively enumerating how many of each letter are in every word in every language, but it’s just not useful enough to be worth it.

It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?

brrrrrm

emergent behavior. These things are surprisingly good at generalizing

jppope

Interested to see how it all works out. Elon has been using a lot of smoke and mirrors lately, but this seems like an area where they can genuinely make progress - with the right talent competing in the GenAi world is totally possible right now. sign me up for improvements in this space!

HN

Grok 4 Launch [video]

Grok 4 Launch [video]