Whether you bake the behaviour in or wrap it in an external loop, you need to train/tune the expected behaviour. Generic models can do chain of thought if asked for, but will be worse than the specialised one.
Beam search just traverses different paths and assigns each path a probability of being correct. The paths with the higher probabilities are kept and the ones with lower probabilities are pruned until the search terminates with an "answer". The marketing department calls it "reasoning" and "test-time compute" because the average consumer does not care whether it's beam search or something else.
Your link seems to do a good job of explaining beam search but it's a classic algorithm in state space exploration so most books on search algorithms and discrete optimization will have a section about it.¹
An algorithm that searches for the highest probability answer is not "reasoning"? "Search" has been a fundamental building block of GOFAI since the beginning. How do you define reasoning? Can you justify it being different from the last 70 years of thought on the topic?
Since you asked, I define reasoning as cambridge does:
Reasoning
"the process of thinking about something in order to make a decision"
Thinking:
"the activity of using your mind to consider something"
Mind:
"the part of a person that makes it possible for him or her to think, feel emotions, and understand things"
I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"
I also think that the definition of Mind of Cambridge is incomplete and lacks the creativity part along with cognition and emotions. But it's a vastly different topic.
>I conclude that "An algorithm that searches for the highest probability answer" is not described by "reasoning"
I recall reading about a theory of neurology where many thoughts (neural circuits) fired simultaneously and then one would "win" and the others got suppressed. The closest thing I can find right now is Global Workspace Theory.
The Chinese Room is a theoretical room that contains a “Chinese speaker” but actually when given a text to ‘understand’ actually just looks up the text in a huge number of words inside until it finds a way to find a response and then just outputs the response as its reply
I looked into it, this "beam search" is nothing but a bit of puffed up nomenclature, not unlike the shock and awe of understanding a language such as Java that introduces synonyms for common terms for no apparent reason, not unlike the intimidating name of "bonferroni multiple test correction" which is just a (1/n) divison operation.
"Beam search" is breadth-first search. Instead of taking all the child nodes at a layer, it takes the top <n> according to some heuristic. But "top n" wasn't enough for whoever cooked up that trivial algorithm, so instead it's "beam width". It probably has more complexities in AI where that particular heuristic becomes more mathematical and complex, as heuristics tend to do.
The known upper bound for transformers on the fly computation abilities is a complexity class called DLOGTIME-uniform TC^0.
There is a lot to unpack there but if you take FO as being closed under conjunction (∧), negation (¬) and universal quantification (∀); you will find that DLOGTIME-uniform TC^0 is equal to FO+Majority Gates.
So be careful about that distinction.
To help break the above down:
DLOGTIME = Constructible by a RAM or TM in logarithmic time.
uniform = Only one circuit for all input sizes, when circuits families are the default convention
TC^0: Constant-Depth Threshold Circuits
Even NP == SO-E, the second-order queries where the second-order quantifiers are only existantials.
DLOGTIME-uniform TC^0 is a WAY smaller group than most people realize, but anything that is an algorithm or a program basically is logic, with P being FO + transitive closure or a half a dozen other known mappings.
Transformers can figure out syntax, but if you dig into that dlogtime part, you will see that semantic correctness isn't really an option...thus the need to leverage the pattern matching and finding of pre-training as much as possible.
Thanks. If I'm reading this right, the limiting factor on the intelligence of current LLMs is not the network size, nor training data (size/quality) but rather the architecture? Do we know of a better one for complex computations / "reasoning"?
Isn't it a LLM with an algo wrapper?