Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not bad, asked it to count Rs in "strawberry" and Ns in "international", it answered correctly, and it was fast.


This is a bad test for reasoning. The model is either trained on the spelling of its tokens or specific words, or not. It's not a thing reasoning would help with in any way. (Ok, maybe accidentally if it tried different splits it's trained to spell, but that's not the reasoning part either)


It's still not quite perfect at letter counting, but it handles questions like "which is heavier? 2kg of feathers or 1kg of lead" really well. The reasoning steps seem to prevent it from jumping to pattern-matched answers


I don't get why this question is relevant to evaluate the reasoning capacity. Gpt4o (no reasoning in an anthropomorphic sense) answers correctly

--- The reasoning lies in the concept of mass and weight. The weight of an object is determined by its mass, not its material.

1. Mass comparison:

2kg of feathers has a mass of 2 kilograms.

1kg of lead has a mass of 1 kilogram.

Since 2 kilograms is greater than 1 kilogram, the feathers are heavier.

2. Irrelevance of material:

The type of material (feathers or lead) does not change the mass measurement.

Lead is denser than feathers, so 1kg of lead takes up much less space than 2kg of feathers, but the weight is still based on the total mass.

Thus, 2kg of any substance, even something as light as feathers, is heavier than 1kg of a dense material like lead.


Large models have no issues with this question at all. Even llama-70B can handle it without issues, and that is a lot smaller than GPT-4o. But for small models this is a challenging question. llama-8B gets it confidently wrong 4 out of 5 times. gemma-2-9B gets it wrong pretty much every time. quen-coder-7B can handle it, so it's not impossible. It's just uncommon for small models to reliably get this question right, which is why I find it noteworthy that this model does.


Yes makes sense, I didn't took in account the model size and now you mention it makes a lot of sense.


Mine was wrong

1 Problem Understanding The task is to count the number of 'r's in the word 'strawberry'. This requires identifying each occurrence of the letter 'r' in the given word and keeping a count. 2 Data Gathering The word to analyze is 'strawberry'. I need to examine each character in this word. 3 Hypothesis Formation Based on a quick glance, I hypothesize that there are 2 'r's in 'strawberry', but I need to verify this systematically. 4 Analysis I will go through each letter of 'strawberry' one by one: s-t-r-a-w-b-e-r-r-y 5 Calculation Counting the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 9. Total count: 2 6 Verification Double-checking: s(0)-t(0)-r(1)-a(1)-w(1)-b(1)-e(1)-r(2)-r(2)-y(2). Confirmed, there are 2 'r's. 7 Conclusion Drawing The final answer is that there are 2 'r's in the word 'strawberry'. 8 Implementation To present the answer, I will state: 'There are 2 r's in the word "strawberry".' This directly answers the question in a clear and concise manner.


I gave a trick question. It went on right path and then gone bonkers!

how many r's are there in the word strawrrberrry

Thought for 173 seconds

1 problem understanding

The task is to count the number of 'R's in the word 'strawrrberrry'. This requires careful examination of each letter in the given word.

2 data gathering

Let's break down the word into individual letters: s, t, r, a, w, r, r, b, e, r, r, r, y

3 analysis

We need to count the occurrences of the letter 'R' in this sequence. Let's go through each letter and keep a count.

4 calculation

Counting 'R's: 1. First 'R' (t=1), 2. Second 'R' (t=2), 3. Third 'R' (t=3), 4. Fourth 'R' (t=4), 5. Fifth 'R' (t=5). Total count: 5

5 verification

Let's double-check our count by going through the word again: strawrrberrry. Confirmed, there are indeed 5 'R's.

6 error checking

Ensuring we haven't missed any 'R's or counted any extra. The count remains at 5, which appears to be correct.

7 conclusion drawing

Based on our careful count and verification, we can confidently state that there are 5 'R's in the word 'strawrrberrry'.

8 implementation

The final response should be a simple statement: 'There are 5 R's in the word strawrrberrry.'

There are 5 R's in the word strawrrberrry.


I know that this is a typical test of an LLM's ability to reason, but I wonder how much time could be saved by teaching an LLM how to recognise the type of problem that it's inherently bad at, bundling a python interpreter and asking it to to write a short function to solve the problem? Are we not pushing a boulder uphill?


> teaching an LLM how to recognise the type of problem that it's inherently bad at

Solving this is the actual hard part and is either adjacent to or even equivalent to solving the problem of LLM's hallucinating. ChatGPT already includes a Python interpreter tool which can be used if the context indicates its appropriate.


I suppose my question is pointing to another, which is: can one make an LLM that doesn't hallucinate? Isn't that problem inherent to the way that LLMs work? Obviously we can try to clean the data so there isn't any nonsense fed into it, but that'll only get you so far with a probabilistic, stochastic system. As an LLM once told me "Some experts argue that hallucination is an innate limitation of LLMs, akin to confabulation in humans, where false memories are created without the intention to deceive". I'm not sure if I believe that though.


For LLM's no, but the explanation is wrong also, it has nothing to do with 'false memories' and has everything to do with how LLMs work.

Here is the paper.

https://arxiv.org/abs/2401.11817

RAG and fine tuning improve domain specificity and may reduce the problem to a level where you don't care, but it will always be there.

Clean data would help reduce the incidents, possibly to a level that is more usable, but also don't remove the problem.

Considering next token prediction as serial runs on multi tape TMs, with the previous output as the input can help.

Especially if you consider Microsoft's 1.53bitnet that requires full precision for training but can reduce weights to just the sign components for inference.

Unfortunately all paths to explain this I have require graduate level complexity theory and/or diff geometry. Or you relive the Brouwer–Hilbert controversy by trying the logic path.


Does it still count correctly if you ask for variations like "strawerberry"?


I found the following transcript of the thought process to that question equally enlightening and funny:

Thought for 4 seconds

1 problem understanding

The task is to count the number of instances of the letter 'r' in the word 'strawerberry'. This requires identifying each occurrence of 'r' in the given word and keeping a count.

2 data gathering

The word to analyze is 'strawerberry'. I need to examine each character in this word.

3 hypothesis formation

Based on a quick glance, I hypothesize that there are 2 instances of 'r' in 'strawerberry'.

4 analysis

Let's break down the word into individual characters: s, t, r, a, w, e, r, b, e, r, r, y

5 calculation

Now, let's count the 'r's: 1) First 'r' at position 3, 2) Second 'r' at position 7, 3) Third 'r' at position 10. Total count: 3

6 verification

Double-checking the count: s(0), t(0), r(1), a(0), w(0), e(0), r(2), b(0), e(0), r(3), r(3), y(0). Confirmed: 3 instances of 'r'.

7 error checking

The initial hypothesis was incorrect. The actual count (3) differs from the initial guess (2). This highlights the importance of careful analysis rather than quick assumptions.

8 conclusion drawing

The letter 'r' appears 3 times in the word 'strawerberry'.

9 implementation

The final response should be the number 3, as this is the correct count of 'r' instances in 'strawerberry'.


I get:

>Alright, I need to figure out how many 'r's are in the word 'strawerberry'. Let's look at it carefully. I'll write it down: s-t-r-a-w-e-r-b-e-r-r-y. Now, I'll go through each letter one by one.

>Starting from the beginning: s, t, r – that's one 'r'. Then a, w, e, r – that's the second 'r'. Next, b, e, r – that's the third 'r'. And finally, another r before the y, making it the fourth 'r'. So, I count four 'r's in total.

>The word "strawberry" contains four 'r's.


Wow, so close. I wonder what went wrong here?


Tokenization, still. LLMs don’t see individual letters. It’s like using Google translate to ask a German speaker the same question and concluding that Germans can’t count letters.


Does this mean there's no actual reasoning happening in these "reasoning" steps?


Well, the model is based on llama-8b, which is quite bad at reasoning. Reasoning (or things that look and quack like reasoning) is more the domain of 70B+ models, and some of the newer 7B models.

The model is doing well on many reasoning tasks, what they are doing is a massive step up from llama-8b. But it still makes some silly mistakes. I bet if you did the same finetuning procedure with quen-7B or llama-70B as a starting point you would get a quite competent model


Of course there’s no actual reasoning.


Is there something more than prediction going on? I'd like to understand why there would be studies like this: https://arxiv.org/pdf/2307.02477 if it's all just prediction. Are there some algorithms that are called "reasoning", that are somewhere between prediction and full-blown reasoning?


My guess is the framing in that paper was chosen to be provocative. The authors surely understand deep neural networks neither reason nor recite as we understand it in humans.

In any case, the conclusion was decidedly against reasoning.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: