Exploring the Limits of Large Language Models as Quant Traders

83 points by rzk 7 hours ago

kqr 6 hours ago

Super interesting! You can click the "live" link in the header to see how they performed over time. The (geometric) average result at the end seems to be that the LLMs are down 35 % from their initial capital – and they got there in just 96 model-days. That's a daily return of -0.6 %, or a yearly return of -81 %, i.e. practically wiping out the starting capital.

Although I lack the maths to determine it numerically (depends on volatility etc.), it looks to me as though all six are overbetting and would be ruined in the long run. It would have been interesting to compare against a constant fraction portfolio that maintains 1/6 in each asset, as closely as possible while optimising for fees. (Or even better, Cover's universal portfolio, seeded with joint returns from the recent past.)

I couldn't resist starting to look into it. With no costs and no leverage, the hourly rebalanced portfolio just barely outperforms 4/6 coins in the period: https://i.xkqr.org/cfportfolio-vs-6.png. I suspect costs would eat up many of the benefits of rebalancing at this timescale.

This is not too surprising, given the similiarity of coin returns. The mean pairwise correlation is 0.8, the lowest is 0.68. Not particularly good for diversification returns. https://i.xkqr.org/coinscatter.png

> difficulty executing against self-authored plans as state evolves

This is indeed also what I've found trying to make LLMs play text adventures. Even when given a fair bit of help in the prompt, they lose track of the overall goal and find some niche corner to explore very patiently, but ultimately fruitlessly.

falcor84 5 hours ago

Agreed, and I'd also love to see a baseline of human performance here, both of experienced quant traders and of fresh grads who know the theory but never did this sort of trading and aren't familiar with the crypto futures market.
- spaceman_2020 2 hours ago
  
  As someone who trades crypto semi-professionally, this was one of the toughest trading periods I've ever seen and included a massive liquidation event on 10th of October that wiped out over $20B in capital. Any trader who broke even in this period likely outperformed. I know some very, very good traders who got wiped out on leverage on 10th of October when stop losses didn't trigger and prices plummetted to 2021 levels (still no clarity why).
  BTC also performed abysmally during this period with a sustained chop down from $126k to $90k.
  
  kqr 2 hours ago
  
  Note that 10th of October is before the trading period in this experiment. If anything, autoregression over shorter timescales would suggest entering after 10th of October being a good idea!
fragmede 5 hours ago

> find some niche corner to explore very patiently, but ultimately fruitlessly.
What, so they're better at my hobbies than me? Someone give Claude a 3d printer!

lordnacho 5 hours ago

I was chatting to a friend in the space. This guy is both experienced in trading and LLMs, and has gone all-in on using LLMs to get his day-to-day coding done. Now he's working on the model to end all models, which is a fairly ambitious way to put it, but it throws off some interesting conversations.

You need domain knowledge to get this to work. Things like "we fed the model the market data" are actually non-obvious. There might be more than one way to pre-process the data, and what the model sees will greatly affect what actions it comes up with. You also have to think about corner cases, eg when AlphaZero was applied to StarCraft, they had to give it some restrictions on the action rate, that kind of thing. Otherwise the model gets stuck in an imaginary money fountain.

But yeah, the AI thing hasn't passed by the quant trading community. A lot of things going on with AI trading teams being hired in various shops.

Libidinalecon 2 hours ago

You can vibe code in this space as an individual because practically everything you are going to write is already in the training data.
The big Quant hedge funds have been using machine learning for decades. I took the coursera RL in finance class years ago.
The idea you are going to beat Two Sigma at their own game with tokens is just an absurdity.
Personally, I think any individual on their own that claims they are doing anything in the algorithmic / ML high frequency space is full of shit.
I could talk like I am too and sound really impressive to someone outside the space. That is much different though than actually making money on what you claim you are doing.
It reminds me of an artist friend when I was younger. She was an artist and I quite liked her paintings. She would tell everyone she is an artist. She was also an encyclopedia when it came to anything art related. She wasn't actually selling much art though. She lived off the $10k a month allowance her rich father gave her. She wasn't even being dishonest but when you didn't know the full picture a person would just assume she was living off her art sales.
- lordnacho 16 minutes ago
  
  > Personally, I think any individual on their own that claims they are doing anything in the algorithmic / ML high frequency space is full of shit.
  Well I'm in the space, but I've come across more than one guy who discovered a money making algo, all on their own, with all the right ideas but without the industry standard terms for them.
  All logic would suggest this shouldn't be possible, but what I've seen is what I've seen.
- ta12653421 32 minutes ago
  
  >> Personally, I think any individual on their own that claims they are doing anything in the algorithmic / ML high frequency space is full of shit. <<
  do you want to have a chat by Whatsapp then I can show you quite the opposite! :-) And in my case: Nobody knows, only one friend who is also deep in the stuff; people doing this are usually more quiet, since nobody is interested at all. I have some contacts in academia and shared my ideas with them - none of them said: "this wont work"
  (Disclaimer: 25+y IT experience, 15 of them in finance)
JumpCrisscross 4 hours ago

> There might be more than one way to pre-process the data
I'm honestly more hopeful about AI replacing this process than the core algorithmic component, at least directly. (AI could help write the latter. But it's immediately useful for the former.)
poopiokaka 2 hours ago

[dead]

callamdelaney 6 hours ago

The limits of LLM's for systematic trading were and are extremely obvious to anybody with a basic understanding of either field. You may as well be flipping a coin.

ta12653421 29 minutes ago

In general, I agree - but there is one exception, I think: However you put AI into an stat arb context, I think it may help for trading on a daily base like "tell me where i should enter this morning and exit this evening". (not daytrading throughout the whole day)
But, I havent tested it so far since I do not believe it either :D
kqr 5 hours ago

I agree. Plus it's way too short a timeframe to evaluate any trading activity seriously.
But I still think the experiment is interesting because it gives us insight into how LLMs approach risk management, and what effects on that we can have with prompting.
Saline9515 4 hours ago

So what are the limits, given that you seem knowledgeable about it?
rob_c 5 hours ago

At least a coin is faster and more reliable.
falcor84 5 hours ago

20 years ago NNs were considered toys and it was "extremely obvious" to CS professors that AI can't be made to reliably distinguish between arbitrary photos of cats and dogs. But then in 2007 Microsoft released Asirra as a captcha problem [0], which prompted research, and we had an AI solving it not that long after.
Edit - additional detail: The original Asirra paper from October 2007 claimed "Barring a major advance in machine vision, we expect computers will have no better than a 1/54,000 chance of solving it" [0]. It took Philippe Golle from Palo Alto a bit under a year to get "a classifier which is 82.7% accurate in telling apart the images of cats and dogs used in Asirra" and "solve a 12-image Asirra challenge automatically with probability 10.3%" [1].
Edit 2: History is chock-full of examples of human ingenuity solving problems for very little external gain. And here we have a problem where the incentive is almost literally a money printing machine. I expect progress to be very rapid.
[0] https://www.microsoft.com/en-us/research/publication/asirra-...
[1] https://xenon.stanford.edu/~pgolle/papers/dogcat.pdf
- nl 4 hours ago
  
  The Asirra paper isn't from a ML research group. The statement: "Barring a major advance in machine vision, we expect computers will have no better than a 1/54,000 chance of solving it" is just a statement of fact - it wasn't any forms of prediction.
  If you read the paper you note that they surveyed researchers about the current state of the art ("Based on a survey of machine vision literature and vision ex- perts at Microsoft Research, we believe classification accuracy of better than 60% will be difficult without a significant advance in the state of the art.") and noted what had been achieved as PASCAL 2006 ("The 2006 PASCAL Visual Object Classes Challenge [4] included a competition to identify photos as containing several classes of objects, two of which were Cat and Dog. Although cats and dogs were easily distinguishable from other classes (e.g., “bicycle”), they were frequently confused with each other.)
  I was working in an adjacent field at the time. I think the general feeling was that advances in image recognition were certainly possible, but no one knew how to get above the 90% accuracy level reliably. This was in the day of hand coded (and patented!) feature extractors.
  OTOH, stock market prediction via learning methods has a long history, and plenty of reasons to think that long term prediction is actually impossible. Unlike vision systems there isn't another thing that we can point to to say that "it must be possible" and in this case we are literally trying to predict the future.
  Short term prediction works well in some cases in a statistical sense, but long term isn't something that new technology seems likely to solve.
  
  falcor84 2 hours ago
  
  Maybe I misunderstand, but it seems that there's nothing in your comment that contradicts any aspect of mine.
  Regarding image classification. As I see it, a company like Microsoft surveying researchers about the state of the art and then making a business call to recommend the use of it as a captcha is significantly more meaningful of a prediction than any single paper from an ML research group. My intent was just to demonstrate that it was widely considered to be a significant open problem, which it clearly was. That in turn led to wider interest in solving it, and it was solved soon after - much faster than expected by people I spoke to around that time.
  Regarding stock market prediction, of course I'm not claiming that long term prediction is possible. All I'm saying is that I don't see a reason why quant trading could be used as a captcha - it's as pure a pattern matching task as could be, and if AIs can employ all the context and tooling used by humans, I would expect them to be at least as good as humans within a few years. So my prediction is not the end of quant trading, but rather that much of the work of quants would be overtaken by AIs.
  Obviously a big part of trading at the moment is already being done by AIs, so I'm not making a particularly bold claim here. What I'm predicting (and I don't believe that anyone in the field would actually disagree) is that as tech advances, AIs will be given control of longer trading time horizons, moving from the current focus on HFT to day trading and then to longer term investment decisions. I believe that there will still be humans in the loop for many many years, but that these humans would gradually turn their focus to high level investment strategy rather than individual trades.
  
  nl an hour ago
  
  > making a business call to recommend the use of it as a captcha is significantly more meaningful of a prediction than any single paper from an ML research group.
  That's not what this is. It's a research paper from 3 researchers at MSR.
- lambdaone 4 hours ago
  
  What makes trading such a special case is that as you use new technology to increase the capability of your trading system, other market participants you are trading against will be doing the same; it's a never-ending arms race.
  
  ta12653421 26 minutes ago
  
  Good one! The thing is, you are assuming "perfect/symmetric distribution" of all known/available technologies across all market participants - this far off the reality. Sure: Jane Street et al are on the same level, but the next big buckets are a huge variety of trading shops doing whatever proprietary stuff to get their cut; most of them may be aware of the latest buzz, but just dont deploy it et.
  
  jstanley 3 hours ago
  
  That doesn't mean it doesn't work. That means it does work!
  If other market participants chose not to use something then that would show that it doesn't work.

thisisit an hour ago

LLMs are very good at NLP/classification tasks and weak at calculations and numbers. So, I doubt feeding it numerical data is a good idea.

And if you feeding or harnessing as the blog post puts it in a way that where it reasons things like:

> RSI 7-period: 62.5 (neutral-bullish)

Then it is no better than normal automated trading where the program logic is something along the lines if RSI > 80 then exit. And looking at the reasoning trace that is what the model is doing.

> BTC breaking above consolidation zone with strong momentum. RSI at 62.5 shows room to run, MACD positive at 116.5, price well above EMA20. 4H timeframe showing recovery from oversold (RSI 45.4). Targeting retest of $110k-111k zone. Stop below $106,361 protects against false breakout.

My understanding is that technical trading using EMA/timeframes/RSI/MACD etc is big in crypto community. But to automate it you can simply write python code.

I don't know if this is a good use of LLMs. Seems like an overkill. Better use case might have been to see if it can read sentiments from Twitter or something.

ta12653421 an hour ago

>>But to automate it you can simply write python code.
haha, if it would be that easy, most of them would do this? :-D
The thing is - its fucking complicated and most people will give up far before they enter any level of operational capability.
I've developed such a system for myself and Im running it in production (though, not with crypto): And whilte most people will see the complexity in "whatever trading magic you apply", its QUITE the opposite:
- the trading logic itself is simple, its ~ 300 lines
- whats not simple is the part of everything else in the context of "asset management", you need position tracking, state management (orders and positions and account etc.), you need to be able to pour in whatever new quotedata for whatever new assete you identify, the system needs to be stable to work in "mass mode" and be super robust as data provider quality is volatile; you need some type of accounting logic on your side; you need a very capable reporting engine (imagine managing 200 positions simultaneously), I could enlength this list more or less unlimited.
There is MUCH MORE in such an application than the question of "when and how do I trade" - my systems raw source is around 2 MB by today, 3rd party libs and OSS libs not included.
- thisisit 24 minutes ago
  
  You seem to be debating a point which was never made by holding on to one word - simple. I didn't say trading code is simple neither I did say that your trading code setup is simple.
  Still let me clarify - the trading logic as you say is simple and just 300 lines. That is what LLMs seem to be doing in part in the post. The point I made is that doesn't seem to be a good use case for LLMs given that everything costs token. IMO, you could run this in your complex application without spending that much money on tokens.
  If you can explain why original opinion of wasting tokens on something which can "simply" be done in python is wrong, I am all ears.

binsquare 2 hours ago

Today it's clear that there are limitations to LLM's.

But I also see this incredible growth curve to LLM's improvement. 2 years ago, I wouldn't expect llm's to one shot a web application or help me debug obscure bugs and 2 years later I've been proven wrong.

I completely believe that trading is going to be saturated with ai traders in the future. And being able to predict and detect ai trading patterns is going to be an important leverage for human traders if they'll still exist

thunky 2 hours ago

> I completely believe that trading is going to be saturated with ai traders in the future
That's probably good news for us index fund investors. We need people to believe they're going to beat the market.
ta12653421 23 minutes ago

..though even until lately, none of them could tell me how to fix the Azure bug I have with my account: It does not allow me to spin up new machines and shouts an obscure error message :-D

XenophileJKO 6 hours ago

I don't think betting on crypto is really playing to the strengths of the models. I think giving news feeds and setting it on some section of the S&P 500 would be a better evaluation.

aswegs8 6 hours ago

Given that LLMs can't even finish Pokemon Red, how would you expect they are able to trade futures?

falcor84 5 hours ago

(Unless you're a marketer) It makes a lot more sense to build a benchmark before the capabilities are there.
Saline9515 4 hours ago

Because trading is mainly number-based, unlike Pokemon Red?
- terminalbraid 2 hours ago
  
  I'll bite: What part of the game, which is encoded entirely by a finite set of numbers, takes input as numbers, provides output as numbers, and is processed by a CPU that acts in a discrete digital space, cannot be represented by numbers?
wild_pointer 5 hours ago

Hey! That wasn't easy!
agentifysh 5 hours ago

i always felt that emotions, instincts, fear, greed, courage, pain are elements of a self-aware conscious loop system that can't be replicated accurately in a digital system and that a seasoned successful traders realize and utilize that the activity is largely is a psychological one. I'm not talking about neutral plays where you can absorb market fluctuations in the short term to extract 1~2% a week but directional trades that almost all traders play (regardless of how what exotic option strategies they are employing).
also the other curious nature of the markets is its ability to destroy any persistent trading system by reverting to its core stochastic properties and its constant ebb and flow from stability to instability that crescendos into systematic instability that rewrite the rules all over again.
ive tried all sorts of ways to do this and without being a large institution and being able to absorb the noise for neutral or legal quasi insider trading via proximity, for the average joe the emotional/psychological hardness you need to survive and be in the <1% of traders is simply too much, its not unlike any other sports or arts, many dream the dream but only few get interviewed and written about.
rather i think to myself the best trade is the simplest one: buy shares or invest in a business with money or time (strongly recommend against using this unless you have no other means) and sell it at a higher price or maintain a long term DCF from a business you own as leverage/collateral to arbitrage whatever rate your central bank sets on assets in demand or will be in demand.
to me its clear where LLM fits and doesn't but ultimately it cannot, will not, must not replace your own agency.

DivingForGold 2 hours ago

. . . "The (geometric) average result at the end seems to be that the LLMs are down 35 % from their initial capital – and they got there in just 96 model-days. That's a daily return of -0.6 %, or a yearly return of -81 %, i.e. practically wiping out the starting capital."

Proves that LLM's are nowhere near close to AGI.

sd9 2 hours ago

The vast majority of intelligent humans cannot profitably trade on intraday timeframes

ezekiel68 6 hours ago

You don't actually need nanosecond latency to trade effectively in futures markets but it does help to be able to evaluate and make decisions in the single-digit milliseconds range. Almost no generative model is able to perform inference at this latency threshold.

A threshold in the single-digit milliseconds range allows the rapid detection of price reversals (signaling the need to exit a position with least loss) in even the most liquid of real futures contracts (not counting rare "flash crash" events).

graemep 5 hours ago

From the article:
> The models engage in mid-to-low frequency trading (MLFT) trading, where decisions are spaced by minutes to a few hours, not microseconds. In stark contrast to high-frequency trading, MLFT gets us closer to the question we care about: can a model make good choices with a reasonable amount of time and information?
vita7777777 6 hours ago

This is true for some classes of strategies. At the same time there are strategies that can be profitable on longer timeframes. The two worlds are not mutually exclusive.
- rob_c 5 hours ago
  
  Yes, but LLM can barely cope with following the ordering of complex software tutorials linearly. Why would you reasonably expect them unprompted to understand time any better enough to trade and turn a profit?
  
  vita7777777 2 hours ago
  
  My comment makes no such claim. I wrote about different timeframes that trading strategies operate on.

Havoc 6 hours ago

Are language models really the best choice for this?

Seems to me that the outcome would be near random because they are so poorly suited. Which might manifest as

> We also found that the models were highly sensitive to seemingly trivial prompt changes

kqr 5 hours ago

No, LLMs are not a good choice for this – as the results show! If I had to guess, they're experimenting with LLMs for publicity.
- Libidinalecon 2 hours ago
  
  Exactly. This is a performance by a really bad method actor.
baq 6 hours ago

they're tools. treat them as tools.
since they're so general, you need to explore if and how you can use them in your domain. guessing 'they're poorly suited' is just that, guessing. in particular:
> We also found that the models were highly sensitive to seemingly trivial prompt changes
this is as much as obvious for anyone who seriously looked at deploying these, that's why there are some very successful startups in the evals space.
- rob_c 5 hours ago
  
  > guessing 'they're poorly suited' is just that, guessing
  I have a really nice bridge to sell you...
  This "failure" is just a grab at trying to look "cool" and "innovative" I'd bet. Anyone with a modicum of understanding of the tooling (or hell experience they've been around for a few years now, enough for people to build a feeling for this), knows that this it's not a task for a pre-trained general LLM.
  
  baq 2 hours ago
  
  I think you have a different idea of what I'm saying than what I'm actually saying.

spaceman_2020 2 hours ago

Hyperliquid now has select tokenized equities as well. Would love to see how these models perform when trading equities

I've been following these for a while and many of the trades taken by DeepSeek and Qwen were really solid

vita7777777 6 hours ago

This is very thoughtful and interesting. It's worth noting that this is just a start and in future iterations they're planning to give the LLMs much more to work with (e.g. news feeds). It's somewhat predictable that LLMs did poorly with quantitative data only (prices) but I'm very curious to see how they perform once they can read the news and Twitter sentiment.

Lapsa 5 hours ago

I would argue that sentiment classification is where LLMs perform best. folks are already using it for precisely such purpose - have even built a public index out of it
- ritonlajoie 4 hours ago
  
  what index ?
rob_c 5 hours ago

Not just can i guarantee the models are bad with numbers, unless it's a highly tuned and modified version they're too slow for this arena. Stick to using attention transformers in better model designs which have much lower latencies than pre-trained llms...

IAmGraydon 2 hours ago

Crazy how people continue to treat LLMs like they’re anything more than a record of past human knowledge and are then surprised when they can’t predict the future.

bluecalm 6 hours ago

>>LLMs are achieving technical mastery in problem-solving domains on the order of Chess and Go, solving algorithmic puzzles and math proofs competitively in contests such as the ICPC and IMO.

I don't think LLMs are anywhere close to "mastery" in chess or go. Maybe a nitpick but the point is that a NN created to be good at trading is likely to outperform LLMs at this task the same way way NNs created specifically to be good at board games vastly outperform LLMs at those games.

lukan 5 hours ago

"Maybe a nitpick but the point is that a NN created to be good at trading is likely to outperform LLMs at this task the same way way NNs created specifically to be good at board games vastly outperform LLMs at those games."
Disagree. Go and chess are games with very limited rules. Succesful trading on the other hand is not so much a arbitary numbers game, but involves analyzing events in the news happening right now. Agentic LLMs that do this and accordingly buy and sell might succeed here.
(Not what they did here, though
"For the first season, they are not given news or access to the leading “narratives” of the market.")

chronic740202 4 hours ago

Even ChatGPT knows why LLMs for quant trading would never work.

p1dda 5 hours ago

LLM's can do language but not much else, not poker, not trading and definitely no intelligence

Drakim an hour ago

Language is powerful.
Language can do poker, trading, and other intelligent activities.

Edvinyo 5 hours ago

Cool experiment, but it’s nothing more than a random walk.

reedf1 6 hours ago

you simply will lose trading directly with an llm. mapping the dislocation by estimating the percentage of llm trading bots is useful though.

jwpapi 6 hours ago

Isn’t that what Renaissance Technology does?

chronic740202 4 hours ago

> Isn’t that what Renaissance Technology does?
No.
- ta12653421 20 minutes ago
  
  ++1

lvl155 5 hours ago

At the end of the day it all comes down to input data. There are a lot of things you can do to collect proprietary data to give you an edge.

GaryNumanVevo 2 hours ago

That's funny because that advice is _directly_ counter to what most HFT quants say
- lvl155 2 hours ago
  
  Right, because they will tell you exactly how they generate alpha for all the world to see. It’s worth mentioning quant is not all HFT.