Exploring the Limits of Large Language Models as Quant Traders

aswegs8

Given that LLMs can't even finish Pokemon Red, how would you expect they are able to trade futures?

vita7777777

This is very thoughtful and interesting. It's worth noting that this is just a start and in future iterations they're planning to give the LLMs much more to work with (e.g. news feeds). It's somewhat predictable that LLMs did poorly with quantitative data only (prices) but I'm very curious to see how they perform once they can read the news and Twitter sentiment.

Lapsa

I would argue that sentiment classification is where LLMs perform best. folks are already using it for precisely such purpose - have even built a public index out of it

rob_c

Not just can i guarantee the models are bad with numbers, unless it's a highly tuned and modified version they're too slow for this arena. Stick to using attention transformers in better model designs which have much lower latencies than pre-trained llms...

kqr

Super interesting! You can click the "live" link in the header to see how they performed over time. The (geometric) average result at the end seems to be that the LLMs are down 35 % from their initial capital – and they got there in just 96 model-days. That's a daily return of -0.6 %, or a yearly return of -81 %, i.e. practically wiping out the starting capital.

Although I lack the maths to determine it numerically (depends on volatility etc.), it looks to me as though all six are overbetting and would be ruined in the long run. It would have been interesting to compare against a constant fraction portfolio that maintains 1/6 in each asset, as closely as possible while optimising for fees.

> difficulty executing against self-authored plans as state evolves

This is indeed also what I've found trying to make LLMs play text adventures. Even when given a fair bit of help in the prompt, they lose track of the overall goal and find some niche corner to explore very patiently, but ultimately fruitlessly.

ezekiel68

You don't actually need nanosecond latency to trade effectively in futures markets but it does help to be able to evaluate and make decisions in the single-digit milliseconds range. Almost no generative model is able to perform inference at this latency threshold.

A threshold in the single-digit milliseconds range allows the rapid detection of price reversals (signaling the need to exit a position with least loss) in even the most liquid of real futures contracts (not counting rare "flash crash" events).

graemep

From the article:

> The models engage in mid-to-low frequency trading (MLFT) trading, where decisions are spaced by minutes to a few hours, not microseconds. In stark contrast to high-frequency trading, MLFT gets us closer to the question we care about: can a model make good choices with a reasonable amount of time and information?

vita7777777

This is true for some classes of strategies. At the same time there are strategies that can be profitable on longer timeframes. The two worlds are not mutually exclusive.

rob_c

Yes, but LLM can barely cope with following the ordering of complex software tutorials linearly. Why would you reasonably expect them unprompted to understand time any better enough to trade and turn a profit?

XenophileJKO

I don't think betting on crypto is really playing to the strengths of the models. I think giving news feeds and setting it on some section of the S&P 500 would be a better evaluation.

callamdelaney

The limits of LLM's for systematic trading were and are extremely obvious to anybody with a basic understanding of either field. You may as well be flipping a coin.

rob_c

At least a coin is faster and more reliable.

Havoc

Are language models really the best choice for this?

Seems to me that the outcome would be near random because they are so poorly suited. Which might manifest as

> We also found that the models were highly sensitive to seemingly trivial prompt changes

baq

they're tools. treat them as tools.

since they're so general, you need to explore if and how you can use them in your domain. guessing 'they're poorly suited' is just that, guessing. in particular:

> We also found that the models were highly sensitive to seemingly trivial prompt changes

this is as much as obvious for anyone who seriously looked at deploying these, that's why there are some very successful startups in the evals space.

rob_c

> guessing 'they're poorly suited' is just that, guessing

I have a really nice bridge to sell you...

This "failure" is just a grab at trying to look "cool" and "innovative" I'd bet. Anyone with a modicum of understanding of the tooling (or hell experience they've been around for a few years now, enough for people to build a feeling for this), knows that this it's not a task for a pre-trained general LLM.

reedf1

you simply will lose trading directly with an llm. mapping the dislocation by estimating the percentage of llm trading bots is useful though.

jwpapi

Isn’t that what Renaissance Technology does?

bluecalm

>>LLMs are achieving technical mastery in problem-solving domains on the order of Chess and Go, solving algorithmic puzzles and math proofs competitively in contests such as the ICPC and IMO.

I don't think LLMs are anywhere close to "mastery" in chess or go. Maybe a nitpick but the point is that a NN created to be good at trading is likely to outperform LLMs at this task the same way way NNs created specifically to be good at board games vastly outperform LLMs at those games.

lukan

"Maybe a nitpick but the point is that a NN created to be good at trading is likely to outperform LLMs at this task the same way way NNs created specifically to be good at board games vastly outperform LLMs at those games."

Disagree. Go and chess are games with very limited rules. Succesful trading on the other hand is not so much a arbitary numbers game, but involves analyzing events in the news happening right now. Agentic LLMs that do this and accordingly buy and sell might succeed here.

(Not what they did here, though

"For the first season, they are not given news or access to the leading “narratives” of the market.")

HN

Exploring the Limits of Large Language Models as Quant Traders

Exploring the Limits of Large Language Models as Quant Traders