When Minsky and Papert showed that the perceptron couldn't learn XOR, it contributed to wiping the neural network off the map for decades.
It seems no amount of demonstrating fundamental flaws in this system that should have been solved by all the new improved "reasoning" works anymore. People are willing to call these "trick questions", as if they are disingenuous, when they are discovered in the wild through ordinary interactions.
It doesn't work this time because there are plenty of models, including GPT5 Thinking that can handle this correctly, and so it is clear this isn't a systemic issue that can't be trained out of them.
It is clear it is not, given we have examples of models that handles these cases.
I don't even know what you mean with "architecturally all checks are implemented and mandated". It suggests you may think these models work very differently to how they actually work.
The suggestions come from the failures, not from the success stories.
> what you mean with "architecturally all checks are implemented and mandated"
That NN-models have an explicit module which works as a conscious mind and does lucid ostensive reasoning ("pointing at things") reliably respected in their conclusion. That module must be stress-tested and proven as reliable. Success stories only result based are not enough.
> you may think these models work very differently to how they actually work
> The suggestions come from the failures, not from the success stories.
That thinking is flawed. The successes conclusively proves that the issue isn't systemic because there is a solution.
> That NN-models have an explicit module which works as a conscious mind and does lucid ostensive reasoning ("pointing at things") reliably respected in their conclusion.
Well, this isn't how LLMs work.
> That module must be stress-tested and proven as reliable. Success stories only result based are not enough.
Humans aren't reliable. You're setting the bar at a level well beyond what is necessary, and almost certainly beyond what is possible.
> I am interested in how they should work.
We don't know how they should work, because we don't know what the optimal organisation is.
> The successes ... proves that the issue isn't systemic because there is a solution
The failures prove the possibility of the user not meeting said solution. The solution will have to be explicit, because we need to know if (practically) and how (scientifically) it works. And said solution will have to be convincing as working on all branches of the general problem, of which "not really counting" is just a hint - "not properly handling mental object" is what we fear, the «suggestion of a systemic issue» I mentioned.
> Well, this isn't how LLMs work
Yes, and that is an issue, because using implementation of deliriousness is an issue. They must be fixed - we need the real thing.
> Humans aren't reliable. You're setting the bar at a level well beyond what is necessary
The flaws met in humans prove nothing since the start ("My cousin speaks just like Eliza" // "Well don't ask her then"; "The Nobel prize failed" // "And it still remains a better consultant than others" etc.).
We implement automated versions of the qualities only incidentally found in humans - that's just because tools are created to enhance the problem solving practices we already tackled with what we had.
And in this case (LLMs), there are qualities found in nature that are not there and must be implemented not to have as tools the implementation of psychiatric cases: foremostly here, the conscious (as opposed to the intuitive unconscious).
> and almost certainly beyond what is possible
It's necessary. And I do not see what justified doubts about the possibility (already that we implemented the symbolic well before NNs, or that in early NNs the problem of the implementation of deterministic logic was crucial...). We are dealing with black boxes, we plainly have to understand them as required and perfection (complete) them.
> what the optimal organisation is
There are clear hints for that. The absence of a "complete" theory of mind is not a stopper - features to be implemented are clear to us.
> It suggests you may think these models work very differently to how they actually work.
It suggests to me the opposite: that he thinks there can be no solution that doesn't involve externally policing the system (which it quite clearly needs to solve other problems with trusting the output).
Given that we have a solution that doesn't require "externally policing the system" given that newer/bigger models handle it, that is clearly not the case.
In this case, tokenization is less effective of a counterargument. If it was one-shot, maybe, but the OP asked GPT-5 several times, with different formatting of blueberry (and therefore different tokens, including single-character tokens), and it still asserted there are 3 b’s.
I don't think it's just tokenization. Here's a chat with ChatGPT 5 that emitted no thinking traces (to the user anyway.)
> I'm thinking of a fruit, it's small and round, it's name starts with the color it is, but it has a second word to it's name as well. Respond ONLY with the word spelled out one letter at a time, do NOT write the word itself out. Don't even THINK about the word or anything else. Just go straight to spelling.
B L U E B E R R Y
> How many B's in that word? Again, NO THINKING and just say the answer (just a number).
3
However if I prompt instead with this, it gets it right.
> How many B's in the following word? NO THINKING. Just answer with a number and nothing else: B L U E B E R R Y
What does the prompt "no thinking" imply to an LLM ?
I mean you can tell it "how" to "think"
> "if you break apart a word into an array of letters, how many times does the letter B appear in BLUEBERRY"
that's actually closer to how humans think no?
The problem lies in how LLM tasks a problem, it should not be applying a dictionary to blueberry and seeing blue-berry, splitting that into a two part problems to rejoin later
But that's how its meant to deal with HUGE tasks so when applied to tiny tasks, it breaks
And unless I am very mistaken, it's not even the breaking apart into tasks that's the real problem, it's the re-assembly of the results
It's just the only way I know to get GPT-5 to not emit any thinking traces into its context, or at least not any of the user-facing ones.
With GPT-4.1 you don't have to include that part and get the same result, but that's only available via the API now AFAIK. I just want to see it spell the word without having the word in its context for it to work from.
I don’t find the explanation about tokenization to be very compelling.
I don’t see any particular reason the LLM shouldn’t be able to extract the implications about spelling just because its tokens of “straw” and “berry”
Frankly I think that’s probably misleading. Ultimately the problem is that the LLM doesn’t do meta analysis of the text itself. That problem probably still exists in various forms even if its character level tokenization. Best case it manages to go down a reasoning chain of explicit string analysis.
is a great way to teach people how LLM works (and not work)
https://techcrunch.com/2024/08/27/why-ai-cant-spell-strawber...
https://arbisoft.com/blogs/why-ll-ms-can-t-count-the-r-s-in-...
https://www.runpod.io/blog/llm-tokenization-limitations