This is a bad test for reasoning. The model is either trained on the spelling of its tokens or specific words, or not. It's not a thing reasoning would help with in any way. (Ok, maybe accidentally if it tried different splits it's trained to spell, but that's not the reasoning part either)
It's still not quite perfect at letter counting, but it handles questions like "which is heavier? 2kg of feathers or 1kg of lead" really well. The reasoning steps seem to prevent it from jumping to pattern-matched answers
Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.
1
Problem Understanding
The task is to count the number of 'r's in the word 'strawberry'. This requires identifying each occurrence of the letter 'r' in the given word and keeping a count.
2
Data Gathering
The word to analyze is 'strawberry'. I need to examine each character in this word.
3
Hypothesis Formation
Based on a quick glance, I hypothesize that there are 2 'r's in 'strawberry', but I need to verify this systematically.
4
Analysis
I will go through each letter of 'strawberry' one by one: s-t-r-a-w-b-e-r-r-y
5
Calculation
Counting the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 9. Total count: 2
6
Verification
Double-checking: s(0)-t(0)-r(1)-a(1)-w(1)-b(1)-e(1)-r(2)-r(2)-y(2). Confirmed, there are 2 'r's.
7
Conclusion Drawing
The final answer is that there are 2 'r's in the word 'strawberry'.
8
Implementation
To present the answer, I will state: 'There are 2 r's in the word "strawberry".' This directly answers the question in a clear and concise manner.
I know that this is a typical test of an LLM's ability to reason, but I wonder how much time could be saved by teaching an LLM how to recognise the type of problem that it's inherently bad at, bundling a python interpreter and asking it to to write a short function to solve the problem? Are we not pushing a boulder uphill?
> teaching an LLM how to recognise the type of problem that it's inherently bad at
Solving this is the actual hard part and is either adjacent to or even equivalent to solving the problem of LLM's hallucinating. ChatGPT already includes a Python interpreter tool which can be used if the context indicates its appropriate.
I suppose my question is pointing to another, which is: can one make an LLM that doesn't hallucinate? Isn't that problem inherent to the way that LLMs work? Obviously we can try to clean the data so there isn't any nonsense fed into it, but that'll only get you so far with a probabilistic, stochastic system. As an LLM once told me "Some experts argue that hallucination is an innate limitation of LLMs, akin to confabulation in humans, where false memories are created without the intention to deceive". I'm not sure if I believe that though.
RAG and fine tuning improve domain specificity and may reduce the problem to a level where you don't care, but it will always be there.
Clean data would help reduce the incidents, possibly to a level that is more usable, but also don't remove the problem.
Considering next token prediction as serial runs on multi tape TMs, with the previous output as the input can help.
Especially if you consider Microsoft's 1.53bitnet that requires full precision for training but can reduce weights to just the sign components for inference.
Unfortunately all paths to explain this I have require graduate level complexity theory and/or diff geometry. Or you relive the
Brouwer–Hilbert controversy by trying the logic path.
I found the following transcript of the thought process to that question equally enlightening and funny:
Thought for 4 seconds
1 problem understanding
The task is to count the number of instances of the letter 'r' in the word 'strawerberry'. This requires identifying each occurrence of 'r' in the given word and keeping a count.
2 data gathering
The word to analyze is 'strawerberry'. I need to examine each character in this word.
3 hypothesis formation
Based on a quick glance, I hypothesize that there are 2 instances of 'r' in 'strawerberry'.
4 analysis
Let's break down the word into individual characters: s, t, r, a, w, e, r, b, e, r, r, y
5 calculation
Now, let's count the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 7, 3) Third 'r' at position 10. Total count: 3
6 verification
Double-checking the count: s(0), t(0), r(1), a(0), w(0), e(0), r(2), b(0), e(0), r(3), r(3), y(0). Confirmed: 3 instances of 'r'.
7 error checking
The initial hypothesis was incorrect. The actual count (3) differs from the initial guess (2). This highlights the importance of careful analysis rather than quick assumptions.
8 conclusion drawing
The letter 'r' appears 3 times in the word 'strawerberry'.
9 implementation
The final response should be the number 3, as this is the correct count of 'r' instances in 'strawerberry'.
>Alright, I need to figure out how many 'r's are in the word 'strawerberry'. Let's look at it carefully. I'll write it down: s-t-r-a-w-e-r-b-e-r-r-y. Now, I'll go through each letter one by one.
>Starting from the beginning: s, t, r – that's one 'r'. Then a, w, e, r – that's the second 'r'. Next, b, e, r – that's the third 'r'. And finally, another r before the y, making it the fourth 'r'. So, I count four 'r's in total.
Tokenization, still. LLMs don’t see individual letters. It’s like using Google translate to ask a German speaker the same question and concluding that Germans can’t count letters.
Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.
The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model
Is there something more than prediction going on? I'd like to understand why there would be studies like this: https://arxiv.org/pdf/2307.02477 if it's all just prediction. Are there some algorithms that are called "reasoning", that are somewhere between prediction and full-blown reasoning?
My guess is the framing in that paper was chosen to be provocative. The authors surely understand deep neural networks neither reason nor recite as we understand it in humans.
In any case, the conclusion was decidedly against reasoning.