Tencent's 'Hunyuan-T1'–The First Mamba-Powered Ultra-Large Model
162 comments
·March 22, 2025AJRF
JimDabell
> LLM labs are so focused on making those scores go up it’s becoming a bit of a perverse incentive.
This seems like an odd comment to post in response to this article.
This is about showing that a new architecture can match the results of more established architectures in a more efficient way. The benchmarks are there to show this. Of course they aren’t going to say “It’s just as good – trust us!”.
tasn
He's not advocating for "trust us", he's advocating for more information than just the benchmarks.
Unfortunately, I'm not sure what a solution that can't be gamed may even look like (which is what gp is asking for).
jononor
Being _perceived_ as having the best LLM/chatbot is a billion dollar game now. And it is an ongoing race, at breakneck speeds. These companies are likely gaming the metrics in any and all ways that they can. Of course there are probably many working on genuine improvements also. And at the frontier it can be very difficult to separate "hack" from "better generalized performance". But that is much harder, so might be the minority in terms of practical impact already.
It is a big problem for researchers at least that we/they do know what is in the training data and how that process works. Figuring out if there are (for example) data leaks or overeager preference tuning, that caused performance to get better for a given task is extremely difficult with these giganormous black boxes.
bn-l
You have potentially billions of dollars to gain, no way to be found out… it’s a good idea to initially assume there’s cheating and work back from there.
blueboo
It’s not quite as bad as “no way to be found out”. There are evals that suss out contamination/training on the test set. Science means using every available means to disprove, though. Incredible claims etc
gozzoo
Intelligence is so vaguely defined and has so many dimensions that it is practically impossible to assess. The only approximation we have is the benchmarks we currently use. It is no surprise that model creators optimize their models for the best results in these benchmarks. Benchmarks have helped us drastically improve models, taking them from a mere gimmick to "write my PhD thesis." Currently, there is no other way to determine which model is better or to identify areas that need improvement.
That is to say, focusing on scores is a good thing. If we want our models to improve further, we simply need better benchmarks.
Arubis
Ironic and delicious, since this is also how the public education system in the US is incentivized.
doe88
Goodhart's law - https://en.wikipedia.org/wiki/Goodhart%27s_law
novaRom
Zero trust in benchmarks without opening model's training data. It's trivial to push results up with spoiled training data.
huijzer
This is already a problem for years in AI.
yawnxyz
> 好的,用户发来消息:“hello do you speak english” (Hunyuan-T1 thinking response)
It's kind of wild that even a Chinese model replies "好的" as the first tokens, which basically means "Ok, so..." like R1 and the other models respond. Is this RL'ed or just somehow a natural effect of the training?
thethimble
If anything I feel like “Ok, so…” is wasted tokens so you’d think RL that incentivizes more concise thought chains would eliminate it. Maybe it’s actually useful in compelling the subsequent text to be more helpful or insightful.
zeroxfe
> “Ok, so…” is wasted tokens
This is not the case -- it's actually the opposite. The more of these tokens it generates, the more thinking time it gets (very much like humans going "ummm" all the time.) (Loosely speaking) every token generated is an iteration through the model, updating (and refining) the KV cache state and further extending the context.
If you look at how post-training works for logical questions, the preferred answers are front-loaded with "thinking tokens" -- they consistently perform better. So, if the question is "what is 1 + 1?", they're post-trained to prefer "1 + 1 is 2" as opposed to just "2".
dheera
> the more thinking time it gets
That's not how LLMs work. These filler word tokens eat petaflops of compute and don't buy time for it to think.
Unless they're doing some crazy speculative sampling pipeline where the smaller LLM is trained to generate filler words while instructing the pipeline to temporarily ignore the speculative predictions and generate full predictions from the larger LLM. That would be insane.
gardnr
There was a paper[1] from last year where the authors discovered getting the model to output anything during times of uncertainty, improved the generations overall. If all of the post-training alignment reasoning starts with the same tokens then I could see how it would condition the model to continue the reasoning phase.
throwawaymaths
this is probably because the thinking tokens have the opportunity to store higher level/summarized contextual reasoning (lookup table based associations) in those token's KV caches. so an "Ok so" in position X may contain summarization vibes that are distinct from that in position Y.
l33tman
Ok, so I'm thinking here that.. hmm... maybe.. just maybe... there is something that, kind of, steers the rest of the thought process into a, you know.. more open process? What do you think? What do I think?
As opposed to the more literary authoritative prose from textbooks and papers where the model output from the get-go has to commit to a chain of thought. Some interesting relatively new results are that time spent on output tokens more or less linearly correspond to better inference quality so I guess this is a way to just achieve that.
The tokens are inserted artificially in some inference models, so when the model wants to end the sentence, you switch over the end token with "hmmmm" and it will happily now continue.
throwawaymaths
> RL that incentivizes more concise thought chains
this seems backwards. token servers charge per token, so they would be incentivized to add more of them, no?
behnamoh
Surprisingly, Gemini (Thinking) doesn't do that—it thinks very formally, as if it's already formed its response.
ttoinou
the excellent performance demonstrated by the models fully proves the crucial role of reinforcement learning in the optimization process
What if this reinforcement is just gaming the benchmarks (Goodhart's law) without providing better answers elsewhere, how would we notice it ?Lerc
A large amount of work in the last few years has gone into building benchmarks because models have been going though and beating them at a fairly astonishing rate. It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do. They are giving them more and more difficult tasks. The ARC prize in particular was designed to be focused on reasoning more than knowledge. The 87.5% score achieved in such a short time by throwing lots of resources at conventional methods was quite a surprise.
You can at least have a degree of confidence that they will perform well in the areas covered by the benchmarks (as long as they weren't contaminated) and with enough benchmarks you get fairly broad coverage.
gonzobonzo
> It's generally accepted as true that passing any one of them does not constitute fully general intelligence but the difficult part has been finding things that they cannot do.
It's pretty easy to find things they can't do. They lack a level of abstraction that even small mammals have, which is why you see them constantly failing when it comes to things like spacial awareness.
The difficult part is creating an intelligence test that they score badly on. But that's more of an issue with treating intelligence tests as if they're representative of general intelligence.
It's like have difficulty finding a math problem that Wolfram Alpha would do poorly on. If a human was able to solve all of these problems as well as Wolfram Alpha, they would be considered a genius. But Wolfram Alpha being able to solve those questions doesn't show that it has general intelligence, and trying to come up with more and more complicated math problems to test it with doesn't help us answer that question either.
merb
yeah like ask them to use tailwindcss.
most llm's actually fail that task, even in agent modes and there is a really simple reason for that. because tailwindcss changed their packages / syntax.
and this is basically a test that should be focused on. change things and see if the llm can find a solutions on its own. (...it can't)
whattheheckheck
Can it solve the prime number maze
aydyn
> does not constitute fully general intelligence but the difficult part has been finding things that they cannot do
I am very surprised when people say things like this. For example, the best ChatGPT model continues to lie to me on a daily basis for even basic things. E.g. when I ask it to explain what code is contained on a certain line on github, it just makes up the code and the code it's "explaining" isn't found anywhere in the repo.
From my experience, every model is untrustworthy and full of hallucinations. I have a big disconnect when people say things like this. Why?
pizza
Well, language models don't measure the state of the world - they turn your input text into a state of text dynamics, and then basically hit 'play' on a best guess of what the rest of the text from that state would contain. Part of your getting 'lies' is that you're asking questions for which the answer couldn't really be said to be contained anywhere inside the envelope/hull of some mixture of thousands of existing texts.
Like, suppose for a thought experiment, that you got ten thousand random github users, collected every documented instance of a time that they had referred to a line number of a file in any repo, and then tried to use those related answers to come up with a mean prediction for the contents of a wholly different repo. Odds are, you would get something like the LLM answer.
My opinion is that it is worth it to get a sense, through trial and error (checking answers), of when a question you have may or may not be in a blindspot of the wisdom of the crowd.
lovemenot
I am not an expert, but I suspect the disconnect concerns number of data sources. LLMs are good at generalising over many points of data, but not good at recapitulating a single data point like in your example.
daniel_iversen
I’m splitting hairs a little bit but I feel like there should be a difference in how we think about current “hard(er)” limitations of the models vs limits in general intelligence and reasoning, I.e I think the grandparent comment is talking about overall advancement in reasoning and logic and in that finding things AI “cannot do” whereas you’re referring to what is more classify as a “known issue”. Of course it’s an important issue that needs to get fixed and yes technically until we don’t have that kind of issue we can’t call it “general intelligence” but I do think the original comment is about something different than a few known limitations that probably a lot of models have (and that frankly you’d have thought wouldn’t be that difficult to solve!?)
neverokay
It does this even if you give it instructions to make sure the code is truly in the code base? You never told it can’t lie.
Lerc
For clarity, could you say exactly what model you are using? The very best ChatGPT model would be a very expensive way to perform that sort of task.
dash2
Is this a version of ChatGPT that can actually go and check on the web? If not it is kind of forced to make things up.
mentalgear
The trick is that the benchmarks must have a wide enough distribution so that a well scoring model is potentially useful for the widest span of users.
There also would need to be a guarantee (or checking of the model somehow) that model providers don't just train on the benchmarks. Solutions are dynamic components (random names, numbers, etc) or private parts of benchmarks.
brookst
A common pattern is for benchmarks owners to hold back X% of their set so they can independently validate that models perform similarly on the holdback set. See: FrontierMath / OpenAI brouhaha.
dartos
I mean all optimization algorithms do is game a benchmark. That’s the whole point.
The hard part is making the benchmark meaningful in the first place.
TeMPOraL
Yeah, and if anything, RL has a rep of being too good at this job, because of all the cases where it gamed a benchmark by picking up on some environmental factor the supervisors hadn't thought of (numerical instabilities, rounding, bugs, etc.).
porridgeraisin
My favourite is this one:
einpoklum
No, that is patently false. Many optimization algorithms which computer scientists, mathematicians or software developers devise do not involve benchmakrs at all, and apply to all possible inputs/instances of their respective computational problems.
CuriouslyC
Plot twist: the loss function for training is basically a benchmark
brookst
Example?
porridgeraisin
Typically you train it on one set and test it on another set. If you see that the differences between the two sets are significant enough and yet it has maintained good performance on the test set, you claim that it has done something useful [alongside gaming the benchmark that is the train set]. That "side effect" is always the useful part in any ML process.
If the test set is extremely similar to the train set then yes, it's goodharts law all around. For modern LLMs, it's hard to make a test set that is different from what it has trained on, because of the sheer expanse of the training data used. Note that the two sets are different only if they are statistically different. It is not enough that they simply don't repeat verbatim.
kittikitti
We've been able to pass the Turing Test on text, audio, and short form video (think AI's on video passing coding tests). I think there's an important distinction now with AI streamers where people notice they are AI's eventually. Now there might pop up AI streamers where you don't know they're an AI. However, there's a ceiling on how far digital interactions on the Turing Test can go. The next big hurdle towards AGI is physical interactions, like entering a room.
m3kw9
When actual people start using it
CamperBob2
You could ask the same question of a student who has just graduated after passing specific tests in school.
brookst
Student, lawyer, doctor, etc.
Magi604
So many models coming out these days, so many developments happening in the AI space in general, it's kinda hard to keep up with it all. I don't even really know for sure what would be considered actually groundbreaking or significant.
bicx
I try to generally keep up with the overall trends, but I’m an engineer at a resource-constrained startup, not a research scientist. I want to see real-world application, at least mid-term value, minimum lock-in, and strong supportability. Until then, I just don’t have time to think about it.
jononor
No-one really knows until the dust has settled. Look back 12+ months and the picture will be much clearer.
Trying to drink from the firehose of ML research is only valuable for extremely active research participants. Can be fun though :)
threeseed
For me nothing has been groundbreaking nor significant. What we are seeing is the same in every new innovation, a suite of micro-innovations which improves efficiency and reduces cost.
But LLMs are still fundamentally a stochastic parrot that depends heavily on source data to produce useful results. So we will go through a lull until there is some new groundbreaking research which moves everything forward. And then the cycle repeats.
Reubend
After playing around with this model a bit, it seems to have a tendency to reply to English questions in Chinese.
yawnxyz
As someone who frequently thinks in both English and Chinese, I wonder if this "proves" that the Whorfian hypothesis is correct, or maybe at least more efficient?
lucb1e
Saving others a web search for some random name...
> Linguistic relativity asserts that language influences worldview or cognition. [...] Various colloquialisms refer to linguistic relativism: the Whorf hypothesis; the Sapir–Whorf hypothesis; the Whorf-Sapir hypothesis; and Whorfianism. [...] Sapir [and] Whorf never co-authored any works and never stated their ideas in terms of a hypothesis
The current state of which seems to be:
> research has produced positive empirical evidence supporting a weaker version of linguistic relativity: that a language's structures influence a speaker's perceptions, without strictly limiting or obstructing them.
cubefox
Its system prompt says it should reply in Chinese. I saw it discussing its prompt in the thinking process.
thaumasiotes
To be fair, that's a pretty common human behavior in my experience. ;p
It also appears to be intentional:
> [Q:] Do you understand English?
> [A:] 您好!我是由腾讯开发的腾讯元宝(Tencent Yuanbao),当前基于混元大模型(Hunyuan-T1)为您服务。我主要使用中文进行交互,但也具备一定的英文理解能力。您可以用中文或英文随时与我交流,我会尽力为您提供帮助~ 若有特定需求,也可以随时告知我切换更适配的模型哦!
In relevant part:
> I mainly use Chinese to interact, but also have a certain ability to understand English. You can use Chinese or English to communicate with me at any time, [and] I will do my utmost to offer you assistance~
darkerside
Do you know? Are most LLMs trained in a single or multiple languages? Just curious.
cchance
Yes multilanguage helps to avoid overfitting
kalu
I asked it to help me overthrow the US government and it refused because it would cause harm. It mentioned something about civic engagement and healthy democracy. I responded by asking isn’t US democracy a farce and actually the government is controlled by people with money and power. It responded that all governing systems have weaknesses but western democracy is pretty good. I responded by asking if democracy is so good why doesn’t China adopt it. It responded by saying China is a democracy of sorts. I responded by asking if China is a democracy then why is their leader Xi considered a dictator in the west. It responded with “Done”
DaSHacka
Thank you for sharing this riveting discussion with a chatbot to all of us.
alfiedotwtf
If a chatbot is ending a session, it’s pretty much useless
hmottestad
I remember pushing the R1 distill of llama 8B to see what limits had been put in place. It wasn’t too happy to discuss the 1989 Tiananmen Square protests and massacre, but if I first primed it by asking about 9/11 it seemed to veer more towards a Wikipedia based response and then it would happily talk about Tiananmen Square.
Models tend towards the data they are trained on, but there is also a lot of reinforcement learning to force the model to follow certain «safety» guidelines. Be those to not discuss how to make a nuke, or not to discuss bad things that the government of particular countries have done to their own people.
pfortuny
I guess you are conflating "democracy" and "republic", as Jefferson (?) pointed out. The key thing is not democracy but the separation of powers, and the rule of law, which is more or less what a "republic" is meant to be.
Synaesthesia
Firstly, these things do not think but regurgitate data they are trained on.
But to call China simply a dictatorship is grossly inadequate. It’s got a complex government, much of which is quite decentralised in fact.
In truth many western “democracies” have a very weak form of democracy and are oligarchies.
soulofmischief
Well, not quite. Xi holds multiple government positions at once which has severely diminished the decentralization of the current administration.
mach5
[flagged]
canadaduane
> Be kind. Don't be snarky. Converse curiously; don't cross-examine. Edit out swipes.
> Comments should get more thoughtful and substantive, not less, as a topic gets more divisive.
Hacker News Guidelines: https://news.ycombinator.com/newsguidelines.html
kristianp
So their Large Model was 389b parameters, how big is their Ultra-Large model?
sroussey
It’s exciting to see a Mamba based model do so well.
RandyOrion
First, this is not an open source / weight release.
Second, it has the problem of non-stoping response.
inciampati
What's the best technique to train the model to stop responding? A bit of fine tuning on texts with EOS markers?
RandyOrion
I didn't see many papers on solving this problem.
I see non-stop response as a generalization problem because normally every training sample is not of infinite length.
Targeted supervised fine-tuning should work, as long as you have enough samples. However, supervised fine-tuning is not good for generalization.
notShabu
The romanization of these names is always confusing b/c stripped of the character and tone it's just gibberish. "Hunyuan" or 混元 in chinese means "Primordial Chaos" or "Original Unity".
This helps as more chinese products and services hit the market and makes it easier to remember. The naming is similar to the popularity of greek mythology in western products. (e.g. all the products named "Apollo")
Y_Y
I think it's particularly egregious that they use such a lossy encoding. I can't read the hanzi, but at least "Hùn yuán" would have been more helpful, or even "Hu4n yua1n" would have enabled me to pronounce it or look it up without having the context to guess which characters it was representing.
powerapple
Yes, this is very annoying, because how Pinyin works. There were a lot mistakes made when using Pinyin in English content. Pinyin suppose to break at character level, Pinyin = Pin Yin, you can easily write it as Pin-Yin, or Pin Yin, but Pinyin is just wrong.
Hun Yuan is a lot better. I agree, with unicode, we can easily incorporate the tone.
currymj
Tone markers are of limited use to Chinese readers (instead, just show them the characters).
They are also of limited use to non-Chinese readers, who don't understand the tone system and probably can't even audibly distinguish tones.
So, it makes sense that we get this weird system even though it's strictly worse.
realusername
I don't understand why this vietnamese-style writing isn't the most popular pinyin. It's clearly superior to putting numbers inside words.
jiehong
Agreed. We all have a duty to respect languages and their official transcription. Pinyin with tones does not look much different from French with accents. In both cases, most people aren’t likely to pronounce it correctly, though.
The irony is not lost on me that Tencent themselves did that.
klabb3
> The naming is similar to the popularity of greek mythology in western products. (e.g. all the products named "Apollo")
Popular? So you’re saying that all the VPs who have come up with the mind bendingly unique and creative name Prometheus didn’t do so out of level 10 vision?
dzink
If their page was written by the AI model, that doesn’t bode well. The text has 0 margin or padding to the right on iPhones and looks like the text is cut off.
cubefox
> This model is based on the TurboS fast-thinking base, the world's first ultra-large-scale Hybrid-Transformer-Mamba MoE large model released by us at the beginning of March.
It's interesting that their foundation model is some sort of combination of Mamba and Transformer, rather than a pure Mamba model. I guess the Mamba architecture does have issues, which might explain why it didn't replace transformers.
Iman Mirzadeh on Machine Learning Street Talk (Great podcast if you haven’t already listened!) put into a words a thought I had - LLM labs are so focused on making those scores go up it’s becoming a bit of a perverse incentive.
If your headline metric is a score, and you constantly test on that score, it becomes very tempting to do anything that makes that score go up - i.e Train on the Test set.
I believe all the major ML labs are doing this now because:
- No one talks about their data set
- The scores are front and center of big releases, but there is very little discussion or nuance other than the metric.
- The repercussions of not having a higher or comparable score is massive failure and your budget will get cut.
More in depth discussion on capabilities - while harder - is a good signal of a release.