An LLM is a poor computational/predictive paradigm for playing chess.
This just in: a hammer makes a poor screwdriver.
LLMs are more like a leaf blower though
Actually, a very specific model (chatgpt3.5-turbo-instruct) was pretty good at chess (around 1700 elo if i remember correctly).
I’m impressed, if that’s true! In general, an LLM’s training cost vs. an LSTM, RNN, or some other more appropriate DNN algorithm suitable for the ruleset is laughably high.
Oh yes, cost of training are ofc a great loss here, it’s not optimized at all, and it’s stuck at an average level.
Interestingly, i believe some people did research on it and found some parameters in the model that seemed to represent the state of the chess board (as in, they seem to reflect the current state of the board, and when artificially modified, the model takes modification into account in its playing). It was used by a french youtuber to show how LLMs can somehow have a kinda representation of the world. I can try to get the sources back if you’re interested.
Absolutely interested. Thank you for your time to share that.
My career path in neural networks began as a researcher for cancerous tissue object detection in medical diagnostic imaging. Now it is switched to generative models for CAD (architecture, product design, game assets, etc.). I don’t really mess about with fine-tuning LLMs.
However, I do self-host my own LLMs as code assistants. Thus, I’m only tangentially involved with the current LLM craze.
But it does interest me, nonetheless!
Here is the main blog post that i remembered : it has a follow up, a more scientific version, and uses two other articles as a basis, so you might want to dig around what they mention in the introduction.
It is indeed a quite technical discovery, and it still lacks complete and wider analysis, but it is very interesting for the fact that it kinda invalidates the common gut feeling that llms are pure lucky random.
The underlying neural network tech is the same as what the best chess AIs (AlphaZero, Leela) use. The problem is, as you said, that ChatGPT is designed specifically as an LLM so it’s been optimized strictly to write semi-coherent text first, and then any problem solving beyond that is ancillary. Which should say a lot about how inconsistent ChatGPT is at solving problems, given that it’s not actually optimized for any specific use cases.
Yes, I agree wholeheartedly with your clarification.
My career path, as I stated in a different comment in regards to neural networks, is focused on generative DNNs for CAD applications and parametric 3D modeling. Before that, I began as a researcher in cancerous tissue classification and object detection in medical diagnostic imaging.
Thus, large language models are well out of my area of expertise in terms of the architecture of their models.
However, fundamentally it boils down to the fact that the specific large language model used was designed to predict text and not necessarily solve problems/play games to “win”/“survive”.
(I admit that I’m just parroting what you stated and maybe rehashing what I stated even before that, but I like repeating and refining in simple terms to practice explaining to laymen and, dare I say, clients. It helps me feel as if I don’t come off too pompously when talking about this subject to others; forgive my tedium.)
Yeah, a lot of them hallucinate illegal moves.
Sometimes it seems like most of these AI articles are written by AIs with bad prompts.
Human journalists would hopefully do a little research. A quick search would reveal that researches have been publishing about this for over a year so there’s no need to sensationalize it. Perhaps the human journalist could have spent a little time talking about why LLMs are bad at chess and how researchers are approaching the problem.
LLMs on the other hand, are very good at producing clickbait articles with low information content.
Gotham chess has a video of making chatgpt play chess against stockfish. Spoiler: chatgpt does not do well. It plays okay for a few moves but then the moment it gets in trouble it straight up cheats. Telling it to follow the rules of chess doesn’t help.
This sort of gets to the heart of LLM-based “AI”. That one example to me really shows that there’s no actual reasoning happening inside. It’s producing answers that statistically look like answers that might be given based on that input.
For some things it even works. But calling this intelligence is dubious at best.
Because it doesn’t have any understanding of the rules of chess or even an internal model of the game state, it just has the text of chess games in its training data and can reproduce the notation, but nothing to prevent it from making illegal moves, trying to move or capture pieces that don’t exist, incorrectly declaring check/checkmate, or any number of nonsensical things.
Hallucinating 100% of the time 👌
ChatGPT versus Deepseek is hilarious. They both cheat like crazy and then one side jedi mind tricks the winner into losing.
So they are both masters of troll chess then?
See: King of the Bridge
It plays okay for a few moves but then the moment it gets in trouble it straight up cheats.
Lol. More comparisons to how AI is currently like a young child.
I think the biggest problem is it’s very low ability to “test time adaptability”. Even when combined with a reasonning model outputting into its context, the weights do not learn out of the immediate context.
I think the solution might be to train a LoRa overlay on the fly against the weights and run inference with that AND the unmodified weights and then have an overseer model self evaluate and recompose the raw outputs.
Like humans are way better at answering stuff when it’s a collaboration of more than one person. I suspect the same is true of LLMs.
Like humans are way better at answering stuff when it’s a collaboration of more than one person. I suspect the same is true of LLMs.
It is.
It’s really common for non-language implementations of neural networks. If you have an NN that’s right some percentage of the time, you can often run it through a bunch of copies of the NNs and take the average and that average is correct a higher percentage of the time.
Aider is an open source AI coding assistant that lets you use one model to plan the coding and a second one to do the actual coding. It works better than doing it in a single pass, even if you assign the the same model to planing and coding.
In this case it’s not even bad prompts, it’s a problem domain ChatGPT wasn’t designed to be good at. It’s like saying modern medicine is clearly bullshit because a doctor loses a basketball game.
I imagine the “author” did something like, “Search http://google.scholar.com/ find a publication where AI failed at something and write a paragraph about it.”
It’s not even as bad as the article claims.
Atari isn’t great at chess. https://chess.stackexchange.com/questions/24952/how-strong-is-each-level-of-atari-2600s-video-chess
Random LLMs were nearly as good 2 years ago. https://lmsys.org/blog/2023-05-03-arena/
LLMs that are actually trained for chess have done much better. https://arxiv.org/abs/2501.17186Wouldn’t surprise me if an LLM trained on records of chess moves made good chess moves. I just wouldn’t expect the deployed version of ChatGPT to generate coherent chess moves based on the general text it’s been trained on.
I wouldn’t either but that’s exactly what lmsys.org found.
That blog post had ratings between 858 and 1169. Those are slightly higher than the average rating of human users on popular chess sites. Their latest leaderboard shows them doing even better.
https://lmarena.ai/leaderboard has one of the Gemini models with a rating of 1470. That’s pretty good.
I swear every single article critical of current LLMs is like, “The square got BLASTED by the triangle shape when it completely FAILED to go through the triangle shaped hole.”
It’s newsworthy when the sellers of squares are saying that nobody will ever need a triangle again, and the shape-sector of the stock market is hysterically pumping money into companies that make or use squares.
It’s also from a company claiming they’re getting closer to create morphing shape that can match any hole.
And yet the company offers no explanation for how, exactly, they’re going to get wood to do that.
The press release where OpenAI said we’d never need chess players again
You get 2 triangles in a single square mate…
CHECKMATE!
Touchdown! 3 points!
Well, the first and obvious thing to do to show that AI is bad is to show that AI is bad. If it provides that much of a low-hanging fruit for the demonstration… that just further emphasizes the point.
That’s just clickbait in general these days lol
Ah, you used logic. That’s the issue. They don’t do that.
2025 Mazda MX-5 Miata ‘got absolutely wrecked’ by Inflatable Boat in beginner’s boat racing match — Mazda’s newest model bamboozled by 1930s technology.
Hardly surprising. Llms aren’t -thinking- they’re just shitting out the next token for any given input of tokens.
That’s exactly what thinking is, though.
An LLM is an ordered series of parameterized / weighted nodes which are fed a bunch of tokens, and millions of calculations later result generates the next token to append and repeat the process. It’s like turning a handle on some complex Babbage-esque machine. LLMs use a tiny bit of randomness (“temperature”) when choosing the next token so the responses are not identical each time.
But it is not thinking. Not even remotely so. It’s a simulacrum. If you want to see this, run ollama with the temperature set to 0 e.g.
ollama run gemma3:4b >>> /set parameter temperature 0 >>> what is a leaf
You will get the same answer every single time.
I know what an LLM is doing. You don’t know what your brain is doing.
Using an LLM as a chess engine is like using a power tool as a table leg. Pretty funny honestly, but it’s obviously not going to be good at it, at least not without scaffolding.
is like using a power tool as a table leg.
Then again, our corporate lords and masters are trying to replace all manner of skilled workers with those same LLM “AI” tools.
And clearly that will backfire on them and they’ll eventually scramble to find people with the needed skills, but in the meantime tons of people will have lost their source of income.
If you believe LLMs are not good at anything then there should be relatively little to worry about in the long-term, but I am more concerned.
It’s not obvious to me that it will backfire for them, because I believe LLMs are good at some things (that is, when they are used correctly, for the correct tasks). Currently they’re being applied to far more use cases than they are likely to be good at – either because they’re overhyped or our corporate lords and masters are just experimenting to find out what they’re good at and what not. Some of these cases will be like chess, but others will be like code*.
(* not saying LLMs are good at code in general, but for some coding applications I believe they are vastly more efficient than humans, even if a human expert can currently write higher-quality less-buggy code.)
I believe LLMs are good at some things
The problem is that they’re being used for all the things, including a large number of tasks that thwy are not well suited to.
yeah, we agree on this point. In the short term it’s a disaster. In the long-term, assuming AI’s capabilities don’t continue to improve at the rate they have been, our corporate overlords will only replace people for whom it’s actually worth it to them to replace with AI.
All these comments asking “why don’t they just have chatgpt go and look up the correct answer”.
That’s not how it works, you buffoons, it trains off of datasets long before it releases. It doesn’t think. It doesn’t learn after release, it won’t remember things you try to teach it.
Really lowering my faith in humanity when even the AI skeptics don’t understand that it generates statistical representations of an answer based on answers given in the past.
If you don’t play chess, the Atari is probably going to beat you as well.
LLMs are only good at things to the extent that they have been well-trained in the relevant areas. Not just learning to predict text string sequences, but reinforcement learning after that, where a human or some other agent says “this answer is better than that one” enough times in enough of the right contexts. It mimics the way humans learn, which is through repeated and diverse exposure.
If they set up a system to train it against some chess program, or (much simpler) simply gave it a tool call, it would do much better. Tool calling already exists and would be by far the easiest way.
It could also be instructed to write a chess solver program and then run it, at which point it would be on par with the Atari, but it wouldn’t compete well with a serious chess solver.
Can i fistfight ChatGPT next? I bet I could kick its ass, too :p
this is because an LLM is not made for playing chess
Is anyone actually surprised at that?
So, it fares as well as the average schmuck, proving it is human
/s
Llms useless confirmed once again
Isn’t the Atari just a game console, not a chess engine?
Like, Wikipedia doesn’t mention anything about the Atari 2600 having a built-in chess engine.
If they were willing to run a chess game on the Atari 2600, why did they not apply the same to ChatGPT? There are custom GPTs which claim to use a stockfish API or play at a similar level.
Like this, it’s just unfair. Both platforms are not designed to deal with the task by themselves, but one of them is given the necessary tooling, the other one isn’t. No matter what you think of ChatGPT, that’s not a fair comparison.
Edit: Given the existing replies and downvotes, I think this comment is being misunderstood. I would like to try clarifying again what I meant here.
First of all, I’d like to ask if this article is satire. That’s the only way I can understand the replies I’ve gotten that critized me on grounds of the marketing aspect of LLMs (when the article never brings up that topic itself, nor did I). Like, if this article is just some tongue in cheek type thing about holding LLMs to the standards they’re advertised at, I can understand both the article and the replies I’ve gotten. But the article never suggests so itself. So my assumption when writing my comment was that this is not the case and it is serious.
The Atari is hardware. It can’t play chess on its own. To be able to, you need a game for it which is inserted. Then the Atari can interface with the cartridge and play the game.
ChatGPT is an LLM. Guess what, it also can’t play chess on its own. It also needs to interface with a third party tool that enables it to play chess.
Neither the Atari nor ChatGPT can directly, on their own, play chess. This was my core point.
I merely pointed out that it’s unfair that one party in this comparison is given the tool it needs (the cartridge), but the other party isn’t. Unless this is satire, I don’t see how marketing plays a role here at all.
GPTs which claim to use a stockfish API
Then the actual chess isn’t LLM. If you are going stockfish, then the LLM doesn’t add anything, stockfish is doing everything.
The whole point is the marketing rage is that LLMs can do all kinds of stuff, doubling down on this with the branding of some approaches as “reasoning” models, which are roughly “similar to ‘pre-reasoning’, but forcing use of more tokens on disposable intermediate generation steps”. With this facet of LLM marketing, the promise would be that the LLM can “reason” itself through a chess game without particular enablement. In practice, people trying to feed in gobs of chess data to an LLM end up with an LLM that doesn’t even comply to the rules of the game, let alone provide reasonable competitive responses to an oppone.
Then the actual chess isn’t LLM.
And neither did the Atari 2600 win against ChatGPT. Whatever game they ran on it did.
That’s my point here. The fact that neither Atari 2600 nor ChatGPT are capable of playing chess on their own. They can only do so if you provide them with the necessary tools. Which applies to both of them. Yet only one of them was given those tools here.
Fine, a chess engine that is capable of running with affordable even for the time 1970s electronics will best what marketing folks would have you think is an arbitrarily capable “reasoning” model running on top of the line 2025 hardware.
You can split hairs about “well actually, the 2600 is hardware and a chess engine is the software” but everyone gets the point.
As to assertions that no one should expect an LLM to be a chess engine, well tell that to the industry that is asserting the LLMs are now “reasoning” and provides a basis to replace most of the labor pool. We need stories like this to calibrate expectations in a way common people can understand…
The Atari 2600 is just hardware. The software came on plug-in cartridges. Video Chess was released for it in 1979.