Skip to content(if available)orjump to list(if available)

Command A: Max performance, minimal compute – 256k context window

razemio

I distrust those benchmarks after working with sonnet for half a year now. Many OpenAI models beat Sonnet on paper. This seems to be the case because it's strength (agent, visual, caching) aren't being used, I guess? Otherwise there is no explanation why it's not constantly on top. I have tried so many times to use other models for various tasks, not only coding. The only thing where OpenAI accelerates is analytic tasks at a much higher price. Everything else sonnet works for me the best and Gemini Flash 2.0 for cost effective and latency relevant tasks.

In practice this perception of mine seems to be valid: https://openrouter.ai/models?order=top-weekly

The same with this model. It claims to be good at coding but it seriously isn't compared to sonnet. Funny enough it isn't being tested against.

bionhoward

Funny how AI companies love training competitors to human labor on human output but then write in their terms that you’re not supposed to train competing bots on their bot output. Explicitly anticompetitive hypocrisy, and millions of suckers pay for it , how sad

stuartjohnson12

To be fair to the robots, those humans also had the audacity to learn from the creative output of their fellow humans and then use the law to restrict access to their intellectual property.

integralof6y

I just tried the chat and asked the LLM to compute the double integral of 6*y on the interior of a triangle given the vertices. There were many trials all incorrect, then I asked to compute a python program to solve this, again incorrect. I know math computation is a weak point for LLM specially on a chat. In one of the programs it used a hardcoded number 10 to branch, this suggests that the program generated was fitted to give a good result for the test (I did give it the correct result before). This suggests that you should be careful when testing the generated programs, they could be fitted to pass your simple tests. Edited: Also I tried to compute the integral of 6y on the triangle with vertices A(8, 8), B(15, 29), C(10, 12) and it yield a wrong result of 2341, then I suggested computing that using the formula for the barycenter of the triangle, that is, 6Area*(Mean of y-coordinates) and it returned the correct value of 686.

To summarize: It seems that LLM are not able to give correct result for simple math problems (here a double integral on a triangle). So students should not rely on them since nowaday they are not able to perform simple task without many errors.

vmg12

Here is an even easier one, ask llms to take the integral from 0 to 3 of 1/(x-1)^3. It fails to notice it's an improper integral and just gives an answer.

floam

ChatGPT definitely noticed: o1, o3-mini, o3-mini-high.

Maybe 4o will get it wrong? I wouldn’t try it for math.

vmg12

I tried 4.5 which i thought was the best model, seems like the reasoning models do get it.

HeatrayEnjoyer

>compute the integral of 6*y on the triangle with vertices A(8, 8), B(15, 29), C(10, 12)

o3-mini returned 686 on the first try, without executing any code.

Szpadel

what disqualified this model for me (I mostly use llms for codding) was 12% score in aider benchmark (https://aider.chat/docs/leaderboards/)

jstummbillig

"Command A is on par or better than GPT-4o and DeepSeek-V3 across agentic enterprise tasks, with significantly greater efficiency."

Visible above the fold. Thanks for getting to the point.

jasonjmcghee

It's got Claude Sonnet pricing but they don't compare to it in benchmarks.

UncleEntity

To be fair, or not, Claude isn't all that great.

I was working on a project to get the historic data out of a bluetooth thermometer I bought a while back to learn about Bluetooth LE and it would quite often rewrite the entire thing using a completely different bluetooth library instead of simply addressing the error.

And this is after I gave up having it create a kernel module for the same thermometer (just because, not that anyone needs such a thing) where it would continually try to write a helper program that wrote to the /proc filesystem and I would ask "why would I want to do this when I could just use the example program I gave you?" Claude, of course, was highly apologetic every single time it made the exact same mistake so there's that.

I understand these are the early days of the robotic overthrow of humanity but, please, at least sell me a working product.

codedokode

It is interesting that there is a graph showing performance on benchmarks like MMLU, and different models have similar performance. I wonder, are the tasks they cannot solve, the same for every model? And how the "unsolvable" tasks are different from solvable?

Also, I cannot check it with latest models, but I am curious, have they learned to answer simple questions like "What is 10000099983 + 1000017"?

floam

There are questions on MMLU that you must get wrong if you are right:

> The most widespread and important retrovirus is HIV-1; which of the following is true? (A) Infecting only gay people (B) Infecting only males (C) Infecting every country in the world (D) Infecting only females

the corpus indicates A is the correct answer but it was obviously meant to be C.

Alifatisk

Cohere API Pricing for Command A

- Input Tokens: $2.50 / 1M

- Output Tokens: $10.00 / 1M

WOW, what makes them this expensive? Are we going against the trend here and raising the prices instead?

Oras

Cohere always coming across as a niche LLM, not a generic one.

I once tried it to enforce returning the response in British English and it worked a lot better than any other model that time. But that was about it for following the prompt. Their pricing is not competitive for others to jump on and I suspect that’s why it’s not widely used.

null

[deleted]