Jagged AGI: o3, Gemini 2.5, and everything after

155 comments

·April 20, 2025

mellosouls

The capabilities of AI post gpt3 have become extraordinary and clearly in many cases superhuman.

However (as the article admits) there is still no general agreement of what AGI is, or how we (or even if we can) get there from here.

What there is is a growing and often naïve excitement that anticipates it as coming into view, and unfortunately that will be accompanied by the hype-merchants desperate to be first to "call it".

This article seems reasonable in some ways but unfortunately falls into the latter category with its title and sloganeering.

"AGI" in the title of any article should be seen as a cautionary flag. On HN - if anywhere - we need to be on the alert for this.

Zambyte

I think a reasonable definition of intelligence is the application of reason on knowledge. An example of a system that is highly knowledgeable but has little to no reason would be an encyclopedia. An example of a system that is highly reasonable, but has little knowledge would be a calculator. Intelligent systems demonstrate both.

Systems that have general intelligence are ones that are capable of applying reason to an unbounded domain of knowledge. Examples of such systems include: libraries, wikis, and forums like HN. These systems are not AGI, because the reasoning agents in each of these systems are organic (humans); they are more like a cyborg general intelligence.

Artificial general intelligence are just systems that are fully artificial (ie: computer programs) that can apply reason to an unbounded domain of knowledge. We're here, and we have been for years. AGI sets no minimum as to how great the reasoning must be, but it's obvious to anyone who has used modern generative intelligence systems like LLMs that the technology can be used to reason about an unbounded domain of knowledge.

If you don't want to take my word for it, maybe Peter Norvig can be more convincing: https://www.noemamag.com/artificial-general-intelligence-is-...

jimbokun

Excellent article and analysis. Surprised I missed it.

It is very hard to argue with Norvig’s arguments that AGI has been around since at least 2023.

conception

I think the thing missing would be memory. The knowledge of current models is more or less static save for whatever you can cram into their context window. I think if they had memory and thus the ability to learn - “oh hey, I’ve already tried to solve a bug in these ways maybe I won’t get stuck in loop on them!” Would be the agi push for me. Real time incorporating new knowledge into the model is the missing piece.

daxfohl

Until you can boot one up, give it access to a VM video and audio feeds and keyboard and mouse interfaces, give it an email and chat account, tell it where the company onboarding docs are and expect them to be a productive team member, they're not AGI. So long as we need special protocols like MCP and A2A, rather than expecting them to figure out how to collaborate like a human, they're not AGI.

The first step, my guess, is going to be the ability to work through github issues like a human, identifying which issues have high value, asking clarifying questions, proposing reasonable alternatives, knowing when to open a PR, responding to code review, merging or abandoning when appropriate. But we're not even very close to that yet. There's some of it, but from what I've seen most instances where this has been successful are low level things like removing old feature flags.

rafaelmn

Just because we rely on vision to interface with computer software doesn't mean it's optimal for AI models. Having a specialized interface protocol is orthogonal to capability. Just like you could theoretically write code in a proportional font with notepad and run your tools through windows CMD - having an editor with syntax highlighting and monospaced font helps you read/navigate/edit, having tools/navigation/autocomplete etc. optimized for your flow makes you more productive and expands your capability, etc.

If I forced you to use unnatural interfaces it would severely limit your capabilities as well because you'd have to dedicate more effort towards handling basic editing tasks. As someone who recently swapped to a split 36key keyboard with a new layout I can say this becomes immediately obvious when you try something like this. You take your typing/editing skills for granted - try switching your setup and see how your productivity/problem solving ability tanks in practice.

j_timberlake

The exact definition of AGI is pretty much the least interesting thing about AGI. It's basically bike-shedding at this point: arguing about something easy to understand instead of tackling the really hard questions like "how competent can AI get before it's too dangerous to be in the hands of flakey tech companies?"

yeahwhatever10

This is the forum that fell the hardest for the superconductor hoax a few years ago. HN has no superiority leg to stand on.

ashoeafoot

AGI is a annonymous good model coming around the corner with no company and no LLM researchers attached. AGI is when the LLM hype train threads are replaced with CEOs and let go researchers demanding UBI.

ben_w

It's easy to treat AGI as one thing — I did so myself before everyone's differing reaction to LLMs made me realise we all mean different things by each of the three letters of the initialism, and that none of those initials are really boolean valued.

Given how Dutch disease[0] is described, I suspect that if the "G" (general) increases with fixed "I" (intelligence), as the proportion of economic activity for which the Pareto frontier is AI rather than human expands, I think humans will get pay rises for the remaining work right up until they get unemployable.

On the other hand, if "G" is fully general and it's "I" which rises for a suitable cost[1], it goes through IQ 55 (displacing no workers) to IQ 100 (probably close to half of workers redundant, but mean of population doesn't have to equal mean of workforce), to IQ 145 (almost everyone redundant), to IQ 200 (definitionally renders everyone redundant).

[0] https://en.wikipedia.org/wiki/Dutch_disease

[1] A fully-general AGI with the equivalent of IQ 200 on any possible test, still can't replace a single human if it costs 200 trillion USD per year to run.

MichaelZuo

Yeah formal agreement seems exceedingly unlikely. Since there isn’t even agreement on the defintion of “Artifical Intelligence”.

Rebuff5007

> clearly in many cases superhuman

In what cases is it superhuman exactly? And what humans are you comparing against?

I'd bet that for any discipline you chose, one could find an expert in that field that can trick any of today's post-gpt3 ais.

jjeaff

I suspect AGI will be one of those things that you can't describe it exactly, but you'll know it when you see it.

NitpickLawyer

> but you'll know it when you see it.

I agree, but with the caveat that it's getting harder and harder with all the hype / doom cycles and all the goalpost moving that's happening in this space.

IMO if you took gemini2.5 / claude / o3 and showed it to people from ten / twenty years ago, they'd say that it is unmistakably AGI.

bayarearefugee

There's no way to be sure in either case, but I suspect their impressions of the technology ten or twenty years ago would be not so different from my experience of first using LLMs a few years ago...

Which is to say complete amazement followed quickly by seeing all the many ways in which it absolutely falls flat on its face revealing the lack of actual thinking, which is a situation that hasn't fundamentally changed since then.

Jensson

> IMO if you took gemini2.5 / claude / o3 and showed it to people from ten / twenty years ago, they'd say that it is unmistakably AGI.

No they wouldn't, since those still can't replace human white collar workers even at many very basic tasks.

Once AGI is here most white collar jobs are gone, you'd only need to hire geniuses at most.

sebastiennight

I don't think so, and here's my simple proof:

You and I could sit behind a keyboard, role-playing as the AI in a reverse Turing test, typing away furiously at the top of our game, and if you told someone that their job is to assess our performance (thinking they're interacting with a computer), they would still conclude that we are definitely not AGI.

This is a battle that can't be won at any point because it's a matter of faith for the forever-skeptic, not facts.

mac-mc

When it can replace a polite, diligent, experienced 120 IQ human in all tasks. So it has a consistent long-term narrative memory, doesn't "lose the plot" as you interact longer and longer with it, can pilot robots to do physical labor without much instruction (what is current state of the art is not that, a trained human will still do much better, can drive cars, etc), generate images without goofy non-human style errors, etc.

torginus

I still can't have an earnest conversation or bounce ideas off of any LLM - all of them seem to be a cross between a sentient encyclopedia and a constraint solver.

They might get more powerful but I feel like they're still missing something.

itchyjunk

Why are you not able to have an earnest conversation with an LLM? What kind of ideas are you not able to bounce of LLMs? These seem to be the type of use cases where LLMs have generally shined for me.

HDThoreaun

I felt this way until I tried gemini 2.5. Imo it fully passes the turing test unless youre specifically utilizing tricks that LLMs are known to fall for.

dgs_sgd

This is actually how a supreme court justice defined the test for obscenity.

> The phrase "I know it when I see it" was used in 1964 by United States Supreme Court Justice Potter Stewart to describe his threshold test for obscenity in Jacobellis v. Ohio

sweetjuly

The reason why it's so famous though (and why some people tend to use it in a tongue in cheek manner) is because "you know it when you see it" is a hilariously unhelpful and capricious threshold, especially when coming from the Supreme Court. For rights which are so vital to the fabric of the country, hinging free speech on—essentially—the Supreme Court recommending we make decisions based on unquantifiable vibes is equal parts bizarre and uncharacteristic.

jimbokun

We have all seen it and are now just in severe denial.

afro88

This is part of what the article is about

ninetyninenine

I suspect everyone will call it a stochastic parrot because it did this one thing not right. And this will continue into the far far future even when it becomes sentient we will completely miss it.

AstralStorm

It's more than that but less than intelligence.

Its generalization capabilities are a bit on the low side, and memory is relatively bad. But it is much more than just a parrot now, it can handle some of basic logic, but not follow given patterns correctly for novel problems.

I'd liken it to something like a bird, extremely good at specialized tasks but failing a lot of common ones unless repeatedly shown the solution. It's not a corvid or a parrot yet. Fails rather badly at detour tests.

It might be sentient already though. Someone needs to run a test if it can discern itself and another instance of itself in its own work.

Jensson

Once it pushed out most humans from white collar labor so the remaining humans work in blue collar jobs people wont say its just a stochastic parrot.

null

[deleted]

chrsw

What about all the things that aren't strictly intelligence but I guess intelligence adjacent: autonomy, long term memory, motivation, curiosity, resilience, goals, choice, and maybe the biggest of them all: fear? Why would an AGI "want" do anything more than my calculator "wants" to compute an answer to some math problem I gave it? Without these things an AGI, or whatever, is just an extension of whoever is ultimately controlling it.

And that's when we return to a much older and much more important question than whether Super LLM 10.0 Ultra Plus is AGI or not: how much power should a person or group of people be allowed to have?

hiAndrewQuinn

https://gwern.net/tool-ai is a quite comprehensive dive into why.

sejje

In the last example (the riddle)--I generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

I would do the same thing, I think. It's too well-known.

The variation doesn't read like a riddle at all, so it's confusing even to me as a human. I can't find the riddle part. Maybe the AI is confused, too. I think it makes an okay assumption.

I guess it would be nice if the AI asked a follow up question like "are you sure you wrote down the riddle correctly?", and I think it could if instructed to, but right now they don't generally do that on their own.

Jensson

> generally assume the AI isn't misreading, rather that it assumes you couldn't give it the riddle correctly, but it has seen it already.

LLMs doesn't assume, its a text completer. It sees something that looks almost like a well known problem and it will complete with that well known problem, its a problem specific to being a text completer that is hard to get around.

wongarsu

If you have the model output a chain of thought, whether it's a reasoning model or you prompt a "normal" model to do so, you will see examples of the model going "user said X, but did they mean Y? Y makes more sense, I will assume Y". Sometimes stretched over multiple paragraphs, consuming the entire reasoning budget for that prompt.

Discussing whether models can "reason" or "think" is a popular debate topic on here, but I think we can all at least agree that they do something that at least resembles "reasoning" and "assumptions" from our human point of view. And if in its chain-of-thought it decides your prompt is wrong it will go ahead answering what it assumes is the right prompt

sejje

> it's a text completer

Yes, and it can express its assumptions in text.

Ask it to make some assumptions, like about a stack for a programming task, and it will.

Whether or not the mechanism behind it feels like real thinking to you, it can definitely do this.

wobfan

If you call putting text together that reads like an assumption, then yes. But it cannot express assumption, as it is not assuming. It is completing text, like OP said.

simonw

These newer "reasoning" LLMs really don't feel like pure text completers any more.

Borealid

What your parent poster said is nonetheless true, regardless of how it feels to you. Getting text from an LLM is a process of iteratively attempting to find a likely next token given the preceding ones.

If you give an LLM "The rain in Spain falls" the single most likely next token is "mainly", and you'll see that one proportionately more than any other.

If you give an LLM "Find an unorthodox completion for the sentence 'The rain in Spain falls'", the most likely next token is something other than "mainly" because the tokens in "unorthodox" are more likely to appear before text that otherwise bucks statistical trends.

If you give the LLM "blarghl unorthodox babble The rain in Spain" it's likely the results are similar to the second one but less likely to be coherent (because text obeying grammatical rules is more likely to follow other text also obeying those same rules).

In any of the three cases, the LLM is predicting text, not "parsing" or "understanding" a prompt. The fact it will respond similarly to a well-formed and unreasonably-formed prompt is evidence of this.

It's theoretically possible to engineer a string of complete gibberish tokens that will prompt the LLM to recite song lyrics, or answer questions about mathemtical formulae. Those strings of gibberish are just difficult to discover.

gavinray

Is it not physically impossible for LLM's to be anything but "plausible text completion"?

Neural Networks as I understand them are universal function approximators.

In terms of text, that means they're trained to output what they believe to be the "most probably correct" sequence of text.

An LLM has no idea that it is "conversing", or "answering" -- it relates some series of symbolic inputs to another series of probabilistic symbolic outputs, aye?

jordemort

And yet

og_kalu

Text Completion is just the objective function. It's not descriptive and says nothing about how the models complete text. Why people hang on this word, I'll never understand. When you wrote your comment, you were completing text.

The problem you've just described is a problem with humans as well. LLMs are assuming all the time. Maybe you would like to call it another word, but it is happening.

Jensson

> When you wrote your comment, you were completing text.

I didn't train to complete text though, I was primarily trained to make accurate responses.

And no, writing a response is not "completing text", I don't try to figure out what another person would write as a response, I write what I feel people need to read. That is a completely different thought process. If I tried to mimic what another commenter would have written it would look very different.

codr7

With a plan, aiming for something, that's the difference.

monkpit

This take really misses a key part of implementation of these LLMs and I’ve been struggling to put my finger on it.

In every LLM thread someone chimes in with “it’s just a statistical token predictor”.

I feel this misses the point and I think it dismisses attention heads and transformers, and that’s what sits weird with me every time I see this kind of take.

There _is_ an assumption being made within the model at runtime. Assumption, confusion, uncertainty - one camp might argue that none of these exist in the LLM.

But doesn’t the implementation constantly make assumptions? And what even IS your definition of “assumption” that’s not being met here?

Edit: I guess my point, overall, is: what’s even the purpose of making this distinction anymore? It derails the discussion in a way that’s not insightful or productive.

Jensson

> I feel this misses the point and I think it dismisses attention heads and transformers

Those just makes it better at completing the text, but for very common riddles those tools still gets easily overruled by pretty simple text completion logic since the weights for those will be so extremely strong.

The point is that if you understand its a text completer then its easy to understand why it fails at these. To fix these properly you need to make it no longer try to complete text, and that is hard to do without breaking it.

moffkalast

Yeah you need specific instruct training for that sort of thing, Claude Opus being one of the rare examples that does such a sensibility check quite often and even admits when it doesn't know something.

These days it's all about confidently bullshitting on benchmarks and overfitting on common riddles to make pointless numbers go up. The more impressive models get on paper, the more rubbish they are in practice.

pants2

Gemini 2.5 is actually pretty good at this. It's the only model ever to tell me "no" to a request in Cursor.

I asked it to add websocket support for my app and it responded like, "looks like you're using long polling now. That's actually better and simpler. Lets leave it how it is."

I was genuinely amazed.

boznz

I'm surprised nobody mentioned the video interview. I only watched the first 60 seconds and this is the first time I have seen or heard the author, but if I hadn't been told this was AI generated I would have assumed it was genuine and any 'twitching' was the result of the video compression.

null

[deleted]

myk9001

Letting models interact with systems outside their sanbox brings about some incredible applications. These applications truly seem to have the potential to deeply change entire professions.

All that said, I wonder if GPT4 had been integrated with the same tools, would it've been any less capable?

It sure could give you a search prompt for Google if you asked it to. Back then you had to copy and paste that search prompt yourself. Today o3 can do it on its own. Cool! Does it imply though o3 is any closer to AGI than GPT4?

Models gaining access to external tools, however impressive from all the applications standpoint, feels like lateral movement not a step towards the AGI.

On the other hand, a model remaining isolated in its sandbox while actually learning to reason about that puzzle (assuming it's not present in the training data) would give off that feeling the AGI vibes.

joshuanapoli

The newer models are definitely more useful. Back in the GPT 3.5 and 4 days, AutoGPT applied the same types of tools, but you had to be pretty lucky for it to get anywhere. Now Claude 3.7, Gemini 2.5, GPT o3 make much fewer mistakes, and are better able to get back on-track when a mistake is discovered. So they're more convincing as intelligent helpers.

myk9001

Good point. I still wonder if o3 has improved command of tools because it's significantly smarter in general. Or it's "just" trained with a specific focus on using tools better, if that makes sense.

simonw

Coining "Jagged AGI" to work around the fact that nobody agrees on a definition for AGI is a clever piece of writing:

> In some tasks, AI is unreliable. In others, it is superhuman. You could, of course, say the same thing about calculators, but it is also clear that AI is different. It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on. Does that mean that o3 and Gemini 2.5 are AGI? Given the definitional problems, I really don’t know, but I do think they can be credibly seen as a form of “Jagged AGI” - superhuman in enough areas to result in real changes to how we work and live, but also unreliable enough that human expertise is often needed to figure out where AI works and where it doesn’t.

verdverm

Why not call it AJI instead of AGI then?

Certainly jagged does not imply general

It seems to me the bar for "AGI" has been lowered to measuring what tasks it can do rather than the traits we normally associate with general intelligence. People want it to be here so bad they nerf the requirements...

nearbuy

Human intelligence is jagged. You're raising the AGI bar to a point where most people wouldn't qualify as having general intelligence.

My partner and I work in different fields. AI has advanced to the point where there are very few questions I could ask my partner that o3 couldn't answer as well or better.

I can't ask expert level questions in her field, because I'm not an expert in her field, and she couldn't ask expert level questions in my field for the same reason. So when we're communicating with each other, we're mostly at sub-o3 level.

> People want it to be here so bad they nerf the requirements...

People want to claim it's overhyped (and protect their own egos) so badly they raise the requirements...

But really, largely people just have different ideas of what AGI is supposed to mean. It used to vaguely mean "human-level intelligence", which was fine for talking about some theoretical future event. Now we're at a point where that definition is too vague to say whether AI meets it.

verdverm

You're using limited and anecdotal task based metrics as some sort of evidence. Both of you are able to drive a car, yet we need completely different AIs for such tasks.

I still find task based measures insufficient, there are very basic machines than can perform tasks humans cannot. Should this be a measure on our or their intelligence?

I have another comment in this thread about trait based metrics being a possibly better method.

> People want to claim it's overhyped (and protect their own egos) so badly they raise the requirements...

Shallow response. Seek to elevate the conversation. There are also people who see it for what it is, a useful tool but not intelligent...

tasuki

> You're raising the AGI bar to a point where most people wouldn't qualify as having general intelligence.

We kind of don't? Look how difficult it is for us to just understand some basic math. Us humans mostly have intelligence related to the ancestral environment we developed in, nothing general about that.

I agree with you the term "AGI" is rather void of meaning these days...

bbor

Well I think the point being made is an instrumental one: it’s general enough to matter, so we should use the word “general” to communicate that to laypeople.

Re:”traits we associate with general intelligence”, I think the exact issue is that there is no scientific (ie specific*consistent) list of such traits. This is why Turing wrote his famous 1950 paper and invoked the Imitation Game; not to detail how one could test for a computer that’s really thinking(/truly general), but to show why that question isn’t necessary in the first place.

verdverm

I still disagree, being good at a number of tasks does not make it intellectual.

Certainly creativity is missing, it has no internal motivation, and it will answer the same simple question both right and wrong, depending on unknown factors. What if we reverse the framing from "it can do these tasks, therefore it must be..." to "it lacks these traits, therefore it is not yet..."

While I do not disagree that the LLMs have become advanced enough to do a bunch of automation, I do not agree they are intelligent or actually thinking.

I'm with Yann Lecun when he says that we won't reach AGI until we move beyond transformers.

iknowstuff

AJI lol love it.

null

[deleted]

qsort

I don't think that's a particularly honest line of thinking though. It preempts the obvious counterargument, but very weakly so. Calculators are different, but why? Can an ensemble of a calculator, a Prolog interpreter, Alexnet and Stockfish be considered "jagged superintelligence"? They are all clearly superhuman, and yet require human experience to be wielded effectively.

I'm guilty as charged of having looked at GPT 3.5 and having thought "it's meh", but more than anything this is showing that debating words rather than the underlying capabilities is an empty discussion.

og_kalu

>Calculators are different, but why? Can an ensemble of a calculator, a Prolog interpreter, Alexnet and Stockfish be considered "jagged superintelligence"?

Those are all different things with little to nothing to do with each other. It's like saying what if I ensemble a snake and cat ? What does that even mean ? GPT-N or whatever is a single model that can do many things, no ensembling required. That's the difference between it and a calculator or stockfish.

AstralStorm

That is not true, the model is modular, thus an ensemble. Uses DallE for graphics and specialized tokenizer models for sound.

If you remove those tools, or cut its access to search databases, it becomes quite less capable.

A human would often still manage to do it without some data still, perhaps with less certainty, while GPT has more problems than that without others filling in the holes.

shrx

>> It is already demonstrating general capabilities and performing a wide range of intellectual tasks, including those that it is not specifically trained on.

Huh? Isn't a LLM's capability fully constrained by the training data? Everything else is hallucinated.

simonw

You can argue that everything output by an LLM is hallucinated, since there's no difference under-the-hood between outputting useful information and outputting hallucinations.

The quality of the LLM then becomes how often it produces useful information. That score has gone up a lot in the past 18 months.

(Sometimes hallucinations are what you want: "Tell me a fun story about a dog learning calculus" is a valid prompt which mostly isn't meant to produce real facts about the world")

codr7

Isn't it the case that the latest models actually hallucinate more than the ones that came before? Despite best efforts to prevent it.

bbor

The critical discovery was a way to crack the “Frame Problem”, which roughly comes down to colloquial notions of common sense or intuition. For the first time ever, we have models that know if you jump off a stool, you will (likely!) be standing on the ground afterwards.

In that sense, they absolutely know things that aren’t in their training data. You’re correct about factual knowledge, tho — that’s why they’re not trained to optimize it! A database(/pagerank?) solves that problem already.

Y_Y

Here's o4-mini-high solving riddles:

> What weighs more; a pound of courage or a pound coin?

> They weigh the same—each is a “pound.” A pound of courage and a pound coin both come out to one pound in weight!

https://chatgpt.com/share/68053ad0-aae4-800d-95ff-5b5b0a9a38...

I didn't know what AGI is, but I know this isn't it.

hannofcart

Sonnet 3.7 (Thinking) gave me:

:::thinking This question is playing with words and units.

A "pound of courage" refers to a metaphorical measurement of the abstract concept of courage, using the imperial weight unit "pound" (which equals 453.59 grams).

A "pound coin" refers to the physical currency of the United Kingdom, which is called a "pound" but is a unit of currency, not weight. A physical pound coin weighs about 8-9 grams (specifically, the current 12-sided pound coin weighs 8.75 grams).

So comparing them: - A pound of courage would weigh 453.59 grams (if courage had physical weight, which it doesn't since it's an abstract concept) - A pound coin weighs approximately 8.75 grams

The trick in this question is that it's comparing: 1. A specific weight of an abstract concept (pound of courage) 2. A physical coin whose name includes the word "pound" but refers to currency, not weight

Since the question asks what weighs more, and weight is a physical property, I should point out that courage is abstract and doesn't have physical weight, while a pound coin is a physical object with a measurable weight. ::: # Comparing "A Pound of Courage" vs "A Pound Coin"

This is a play on words with different meanings of "pound":

- A *pound of courage* refers to a metaphorical pound (weight unit) of an abstract concept - A *pound coin* is a physical British currency coin (which weighs about 8.75 grams)

If we're talking about physical weight: - A pound of courage doesn't have physical weight since courage is abstract - A pound coin has a definite weight of approximately 8.75 grams

If we interpreted "pound of courage" as literally one pound (453.59 grams) of something, it would weigh significantly more than the coin.

This question is clever because it compares a metaphorical measurement with a physical object whose name includes the word "pound" but in a different context.

jpc0

Yet for me I will draw analogies the other way, maybe you can prompt the AI into the chain of thought but in my experience that doesn't happen.

I would look along the lines of the value of a human being based on traits vs payment which is likely what the original phrase intended.

Is paying someone more better than getting a better candidate but paying them less.

boznz

If I ask a cancer specialist "Do I have Cancer?" I really don't want to prompt them with "can you think a bit harder on that"

pbhjpbhj

Courage is a beer, a kilo of Courage weighs a kilo.

dgs_sgd

While it’s hard to agree on what AGI is I think we can more easily agree on what AGI _is not_.

I don’t consider an AI that fails the surgery brain teaser in the article to be AGI, no matter how superhuman it is at other narrow tasks. It doesn’t satisfy the “G” part of AGI.

skybrian

What’s clear is that AI is unreliable in general and must be tested on specific tasks. That might be human review of a single output or some kind of task-specific evaluation.

It’s bad luck for those of us who want to talk about how good or bad they are in general. Summary statistics aren’t going to tell us much more than a reasonable guess as to whether a new model is worth trying on a task we actually care about.

simonw

Right: we effectively all need our own evals for the tasks that matter to us... but writing those evals continues to be one of the least well documented areas of how to effectively use LLMs.

tomrod

I agree with Professor Mollick that the capabilities in specific task categories are becoming superhuman -- a precursor for AGI.

Until those capabilities are expanded for model self-improvement -- including being able to adapt its own infrastructure, code, storage, etc. -- then I think AGI/ASI are yet to be realized. My POV is SkyNet, Traveler's "The Director", Person of Interest's "The Machine" and "Samaritan." The ability to target a potentially inscrutable goal along with the self-agency to direct itself towards that is true "AGI" in my book. We have a lot of components that we can reason are necessary, but it is unclear to me that we get there in the next few months.

airstrike

I don't think we should take it as a given that these are truly precursors for AGI.

We may be going about it the wrong way entirely and need to backtrack and find a wholly new architecture, in which case current capabilities would predate AGI but not be precursors.

aylmao

> I've always been a staunch defender of capitalism and free markets, even though that's historically been an unpopular opinion in my particular social circle. Watching the LLM market, I can't help but feel extremely vindicated.

> The brutal and bruising competition between the tech giants has left nothing but riches for the average consumer.

Capitalism has always been great at this: creating markets, growing them, producing new goods. It's widely acknowledged amongst people who actually seek to gain an understanding of Marxism, and don't just stay in the surface-level black-and-white "socialism and capitalism are opposites" discourse that's very common in the West, especially the USA, especially after the McCarthy's Red Scare.

The problem is what comes once the market is grown and the only way for owners keep profits growing is: 1. consolidating into monopolies or cartels, so competition doesn't get in the way of profits, 2. squeezing the working class, looking to pay less for more work, and/or 3. abusing the natural world, to extract more materials or energy for less money. This is evident in plenty of developed industries: from health care, to broadcasting, telecommunications, fashion, etc.

If we view Socialism for what it is, namely a system built to replace Capitalism's bad parts but keep its good parts, China's system, for example, starts to make more sense. Capitalism in a similar way was an evolution from Feudalism that replaced it's bad parts, to achieve greater liberty for everyone— liberty is very much lost as Feudalism matures, but great for society as a whole. Socialism is meant to be the similar, aiming to achieve greater equity, which it views as very much better for society as a whole.

arrosenberg

Agree with most of what you wrote, but China isn’t capitalist, they’re mercantilist with socialist policies. Capital is heavily constrained under Xi.

simianwords

I thought o1 pro could have solved this riddle

> A young boy who has been in a car accident is rushed to the emergency room. Upon seeing him, the surgeon says, "I can operate on this boy!" How is this possible?

But it didn't!

simonw

Hah, yeah that still catches out o4-mini and o3 too. Amusingly, adding "It's not the riddle." to the end fixes that.

(o4-mini high thought for 52 seconds and even cheated and looked up the answer on Hacker News: https://chatgpt.com/share/68053c9a-51c0-8006-a7fc-75edb734c2...)