Why don't LLMs ask for calculators?
76 comments
·February 17, 2025luma
null
eternityforest
But they only know about specific tools with few or zero shot methods, whereas modern humans are highly trained to specifically always route all math through a calculator, it's "in the fine tune" not "in the prompt".
It doesn't use up any "context" to remember that calculators exist.
With the partial exception of Python code, models know how to solve problems with code and even 1.5B models do amazing things if you prompt them with something like "Use the Z3 solver"
IanCal
But if you give them a search they'll use that to look for tools and if you give them the ability they can write new tools for themselves.
roger_
Basic math problems should be treated as a special vs general tool use. Why use Python when the computer evaluating the LLM can do math itself?
mhh__
Because AI programmers only know python (is the path dependency / real reason)
luma
It does do the math itself, it creates the code to do so and then executes that code to get the answer.
mystified5016
The same reason you don't use your desk calculator to run Python scripts on huge datasets. It's the wrong tool.
Running an LLM and parsing and computing mathematical expressions are entirely disjoint operations. You need highly specialized code for each, it makes just as much sense to put a calculator in your LLM as it does to stuff a Python interpreter in a calculator. Could you? Of course, software is infinitely flexible. Does it make sense to do it? No, it makes more sense to connect two different specialized applications than to try shoehorning one into the other.
13years
You still can't get 100% reliability that would be necessary for certain problem domains.
There are going to be some level of hallucination errors in the translation to the agent or code. If it is a complex problem, those will compound.
zeroxfe
You can't get 100% reliability from a human either.
13years
Which is why humans use calculators. That is the key point being made secondary to the reliability. The LLM "knows" it is bad at Math. It knows the purpose of calculators. However, doesn't use this information to inform the user.
It could also propose to the user it could write the answer using code. It doesn't do that either.
xg15
The author adressss this and argues that it misses the point:
> Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments. We use them simply because we have understanding for the capability they provide.
So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves, if both the the definition of a calculator and the fact that LLMs are bad at math are part of the training data.
luma
I’d contest the statement that humans don’t need to be trained to use a calculator. It certainly isn’t instinctive behavior.
neom
I have dyscalculia and I still have no clue about calculators except I was taught how to make it give me the answer to math problems, I'm a bit embarrassed to say: even still I sometimes take a few seconds to boot into being able to use one. We often discuss LLMs like there is no divergence in humans, I don't know how many people math is intuitive for, but I know plenty of people like me.
13years
I first used a calculator as a kid. Took about 30 seconds. Never had instruction or training. We aren't talking about scientific calculators.
LeifCarrotson
I do think it's interesting to think about why the LLM needs to be told to ask for a calculator and when to do that. And not just in individual prompts where humans manually prompt ask it to "write some code to find the answer" but in general.
We often use the colloquial definition of training to mean something to the effect of taking an input, attempting an output, and being told whether that output was right or wrong. LLMs extend that to taking a character or syllable token as input, doing some computation, predicting the next token(s), and seeing if that was right or wrong. I'd expect the training data to have enough content to memorize single-digit multiplication, but I'd expect it to also learn that this model doesn't work for multiplying an 11 digit number by a 14 digit number.
The "use a calculator" concept and "look it up in a table" concepts were taught to the LLM too late and it didn't internalize that as a way to perform better.
recursive
> Humans require no training at all for calculators, as they are such intuitive instruments
I don't think that's even true though. If you think this, I would suggest you've just internalized your training on the subject.
IanCal
> So the real question behind the headline is why LLMs don't learn to ask for a calculator by themselves
They can. They're sometimes a bit cocky about their maths abilities, but this really isn't hard to test or show.
https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...
They can also create tools that can be useful.
crackrook
This still doesn't get at the point, with this example you've effectively constructed a prompt along the lines of: "Note: A calculator is available upon request wink, here's how you'd use it: ... Now, what's the eighth root of 4819387574?"
Of course the model will use the calculator you've explicitly informed it of. The article is meant to be a critique of claims that LLMs are "intelligent," when, despite knowing their math limitations, don't generally answer "You'd be better off punching this into a calculator" when asked a problem
amelius
Yes, but the takeaway message is that even laypeople can invent how LLMs should work.
colonCapitalDee
Claude Sonnet 3.5 will often use JavaScript as calculator. It's not perfect when it comes to deciding whether it should write code, but that's easy to fix by prompting it with "Write some code to help you answer the question".
The post is honestly quite strange. "When LLMs try and do math themselves they often get it wrong" and "LLMs don't use tools" are two entirely different claims! The first claim is true, the second claim is false, and yet the article uses the truth of the first claim as evidence for the second! This does not hold up at all.
enragedcacti
The claim isn't "LLMs don't use tools", the author is saying that LLMs can't make reliable inferences regarding their own knowledge or capabilities which fundamentally limits their usefulness for many tasks. LLMs "know" that LLMs can't do math reliably, LLMs "know" that calculators can do math reliably, and yet LLMs generally just soldier on and try to do math themselves when asked. You can of course RL it or prompt it into writing javascript when it sees math but so far LLMs haven't been capable of generalizing the process of "I am bad at X" + "Thing is good at X" -> "I should ask for Thing to do X" unless that specific chain of thought is common in the training data.
The solution so far has just been to throw more RL or carefully crafted synthetic data at it but its arguably more pavlovian than it is generalized learning.
Someone could teach a dog to ring a bell that says "food" on it, and you could reasonably argue that it is using a tool. Will it then know to ring a bell that says "walk" when it wants to go outside?
IanCal
I gave sonnet a hard arithmetic problem, and the ability to look for tools. It looked for a calculator, I gave it one and it used that.
https://gist.github.com/IanCal/2a92debee11a5d72d62119d72b965...
enragedcacti
The availability of tools and what they're named is going to influence it's behavior. Gemini 2.0 Pro can obviously get this question right on it's own but the existence of a find_tool() option causes it to use it. Sorry it's scuffed, I just did it on my phone to make the point but I'd imagine you could get similar results with the tools param as all it's doing is putting the tool options into the context.
You are an advanced AI assistant that has a number of tools available to you. in order to use a tool, respond with "USE TOOL: <tool_name>(tool_parameter)".
Tools:
select_tool(<tool_name>)
find_tool(<search_term>)
Who stars in The Godfather?
> USE TOOL: find_tool("The Godfather cast")
dr_dshiv
The post is honestly very normal — many mid level intellectuals think this way and love these talk downs of LLMs. I don’t understand it for the life of me.
exe34
> A simple observation of LLM behavior can tell us much more than might be apparent at first.
I love the "I'm an expert, I ask simple questions and reveal profound truths" vibe.
null
PaulHoule
Many LLMs, particularly, coding assistants, use "tools". Here is one with a calculator
https://githubnext.com/projects/gpt4-with-calc/
and another example
https://www.pinecone.io/learn/series/langchain/langchain-too...
LLMs often do a good job at mathy coding, for instance I told Copilot that "i want a python function that computes the collatz sequence for a given starting n and returns it as a list"
def collatz_sequence(n):
sequence = [n]
while n != 1:
if n % 2 == 0:
n = n // 2
else:
n = 3 * n + 1
sequence.append(n)
return sequence
which gives right answers, which I wouldn't count on copilot being able to do on its own.lblume
Especially for a problem as well-known as this, expect code for it to be seen by the model during training at least a thousand times in different languages, flavors etc. This, on its own, is nothing new.
PaulHoule
I wouldn't trust it to win a fight with the borrow checker in Rust but lots of simple and practical cases work, such as "use comprehensions to give me a sum of the square of all element divisible by three"?
sum(x**2 for x in numbers if x % 3 == 0)
and to do the same for a pandas series with pandas operators after asking it to inline something (numbers[numbers % 3 == 0] ** 2).sum()
It's not a miracle, you have to go at it with some critical thinking and testing and it makes mistakes but so does Stack Overflow.lblume
In the first case, I would literally type out the expression faster than thinking about expressing it in natural language.
13years
That wasn't the point. The Math multiplication errors allow us to peer into the lack of reasoning across all domains.
Yes, you can add tools. But hallucinations will still be there. Tools allow you to cut down on the steps the LLM has to perform. However, if you have a complex problem with many steps, there will be translation errors at some point coordinating the tools.
Furthermore, if there is some other tool needed to get the result you need, the LLM isn't going to tell you. It will typically make up the result.
PaulHoule
Sure, but people make mistakes too. There is no point on wasting an LLM's capacity to multiply
Try doing coding with Cursor or Windsurf, those use tools all the time. Windsurf sometimes has trouble for me on Windows because it wants to write paths like
/c:/something/or/other
and it will try to run its tool, get an error, ask me for help, I'll tell it "you're running on Windows and you can't write / before the c: and you should write \ instead of /" and it does better. I just asked copilot to multiply
839162704321847925107309452196847230165937402194385627409536218 * 582930174682375093104627481695
and the first thing it did was write out the expression, I told it I wanted the integer and it gave the same answer Python gives which is 489173261817269091475894827953471001727389372345981246974410480760096492908180614917234529510
13years
> There is no point on wasting an LLM's capacity to multiply
Agreed. Again, that is not the issue. It is that the LLM does not know it is a waste of time. That is apparent to you as you have intelligence. It is not apparent to the LLM. It is not intelligent.
xigoi
The point is that the LLM cannot figure out on its own that it needs to use the tool, you have to explicitly ask for it.
Sysreq2
A lot of people are talking about tool use and writing internal scripts, and yeah, that’s kind of an answer. Really though I think the author is highlighting that LLMs are not being used efficiently at the present moment.
LLMs are great at certain tasks. Databases are better at certain tasks. Calculators too. While we could continually throw more and more compute at the problem, growing layers and injecting more data, wouldn’t it make more sense to just have an LLM call its own back-end calculator agent? When I ask it for obscure information maybe it should just pull from it’s own internal encyclopedia database.
Let LLMs do what they do well, but let’s not forget the decades that brought us here. Even the smartest human still uses a calculator, so why doesn’t an AI? The fact that it writes its own JavaScript is flashy as hell but also completely unnecessary and error prone.
13years
> Really though I think the author is highlighting that LLMs are not being used efficiently at the present moment.
Yes, that is a key point. It isn't to say they are useless tools, but that they aren't intelligent tools and that has significant meaning for what tasks we think they are appropriate for.
Unfortunately, nearly everyone has misinterpreted the intent as showing LLMs can't use tools. The point is about how LLMs work differently than most think that they do.
crancher
> Now, some might interject here and say we could, of course, train the LLM to ask for a calculator. However, that would not make them intelligent. Humans require no training at all for calculators, as they are such intuitive instruments.
Does the author really believe humans are born with an innate knowledge of calculators and their use?
aithrowawaycomm
I think he means that you don't need to train a human to understand that a calculator is useful, and in particular when a problem is hard enough that you need to bust out a calculator. That sort of logic is self-apparent in humans, but struggles to be consistently evoked in LLMs.
That said, I was using simple +*-/ calculators as a small child and I don't think I needed to be taught anything other than MC/MR. The tool is intuitive if you are familiar with formal written arithmetic (of course hunter-gatherers couldn't make sense of it).
MyOutfitIsVague
I remember having several days of lessons on how to use calculators in elementary, middle, and high school. They are "intuitive", but not all the functions of them are, and if you fully rely on intuition, you might expect them to do things like respect order of operations, which they very well might not.
13years
As a kid of about 5 or 6 years old I used my first calculator with no instruction whatsoever. We are not talking about scientific calculators. Addition, Multiplication. It does not require training or instruction, just a minute of exploration.
luma
You did however need to be taught math first, you needed to learn how to pick things up, read numbers, interact with buttons, understand that a device might have an on and off state, and a zillion other things. It took about 5 or 6 years of training time to make that happen, and it was the result of parents, teachers, or others actively taking time to train you. That process didn’t involve parking you in a library at birth so you could just go figure it out.
Author is simply being obtuse and presumably has some axe to grind or is just ignorant of how LLMs are trained. For example, LLMs don’t learn to chat from the data, they have to be instruct tuned to make that happen. Every LLM chatbot you’ve ever used had to have this extra training step. Further, this is the exact same training process that can also train for tool use.
Trying to say “this should just happen from the data” is silly, it isn’t how any of this works. It’s not how you learned things, and it’s not how LLMs-as-chatbots work.
13years
I'm the author.
> Trying to say “this should just happen from the data” is silly, it isn’t how any of this works. It’s not how you learned things, and it’s not how LLMs-as-chatbots work.
Yes, that was the entire point of the article. We are in agreement.
Workaccount2
I don't know what happened but there was that time when GPT-4 could access wolfram alpha, and anytime you asked it something that was beyond the most basic math, it would automatically prompt wolfram for the answer.
threatripper
That was during the great plugin times. Haven't seen plugins for a while. Do they still exist?
Terr_
> The LLM has no self-reflection for the knowledge it knows and has no understanding of concepts beyond what can be assembled by patterns in language.
My favorite framing: The LLM is just an ego-less extender of text documents. It is being iteratively run against movie script, which is usually incomplete and ending in: "User Says X, and Bot responds with..."
Designers of these systems have--deliberately--tricked consumers into thinking they are talking to the LLM author, rather than supplying mad-libs dialogue for a User character that is the same fictional room as a Bot character.
The Bot can only speak limitations which are story-appropriate for the character. It only says it's bad at math because lots of people have written lots of words saying the same thing. If you changed its name and description to Mathematician Dracula, it would have dialogue about how its awesome at math but can't handle sunlight, crucifixes, and garlic.
This framing also explains how "prompt injection" and "hallucinations" 3 are not exceptional, but standard core behavior.
DrNosferatu
I’m surprised why LLMs don’t have in their system prompt a hard rule instructing that any numeric computations in particular, and any other computations in general must only be performed by tool use / running Python.
sega_sai
I am puzzled by the fact that the modern LLMs don't do multiplication in the same way humans do it, i.e. digit by digit. Surely they can write an algorithm for that, but why can't they perform it ?
threeducks
Two reasons:
1. LLMs "think" in terms of tokens, which usually are around 4 characters each. While humans only have to memorize 10x10 multiplication tables to perform multiplication of large numbers, LLMs have to memorize a 10000x10000 table, which is much more difficult.
2. LLMs can't "think in their head", so you have to make them spell out each step of the multiplication, just like (most) humans can't multiply huge numbers without intermediate steps.
A simple way to demonstrate this is to ask an LLM for the birth year of a celebrity and then whether that number is even or odd. The answer will be correct almost every time. But if you ask whether the birth year of a celebrity is even or odd and forbid spelling out the year, the accuracy will be barely above 50 %.
threatripper
Can't we tokenize numbers always as single digits and give the LLM an <thinking> scratchpad invisible to the user?
threeducks
Yes, but we could also give the LLM access to a Python interpreter and solve a much larger class of problems with correctness guarantees and around a billion times less compute.
bobro
Are there a lot of examples written out of people talking through running that algorithm? I’d guess not.
null
null
null
karparov
And as you see in the responses here, most people miss the point, elect to patch over the aspects in which the lack of intelligence is glaring, and eventually the end product will be so hard to distinguish from actual intelligence that it's deemed "good enough".
Is that bad? Idk. If you hoped that real AGI would eventually solve humanities biggest problems and questions, perhaps so. But if you want something that really really looks like AGI except to some nerds who still say "well actually", then it's gonna be good enough for most. And certainly sufficient for ending up the dystopia from that movie clip in the end.
lern_too_spel
If you don't give a human a calculator, they do the calculation in their head. The same with an LLM. If you give a human a calculator, they will use it. The same with an LLM. Both will say a calculator can do arithmetic better.
I don't believe current LLMs are AGIs but this article's argument is a poor one.
scarface_74
The paid version of ChatGPT has had a built in Python runtime for well over a year.
The [>_] links to the Python code that was run.
https://chatgpt.com/share/67b79516-9918-8010-897c-ba061a2984...
lblume
The free version, at least once you are logged in, does as well. I don't pay OpenAI, and for the prompt
"Calculate the difference between the two biggest primes less than the factorial of 20"
it wrote the following code:
import sympy
# Calculate 20!
factorial_20 = sympy.factorial(20)
# Find the two largest primes less than 20!
largest_prime = sympy.prevprime(factorial_20)
second_largest_prime = sympy.prevprime(largest_prime)
# Calculate the difference
difference = largest_prime - second_largest_prime
difference
executed it, and produced the correct result, 40.
They can do exactly that, it's called Tool Use and nearly all modern models can handle it. For example, I have a consumer GPU that can run a R1 Qwen distill, which, when prompted for a large multiplication, will elect to write a python script to find the answer.
This is a table stakes feature for even the open/free models today.