Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Large Language Models as Markov Chains (arxiv.org)
75 points by Anon84 on Nov 30, 2024 | hide | past | favorite | 54 comments


What's not apparent is how does the Markovian framework account for the apparent long-range dependencies and global coherence we observe in LLM outputs, particularly in tasks requiring multi-step reasoning or maintaining context across many tokens? While the analysis elegantly explains local statistical patterns, it seems to conflict with transformer's ability to maintain attention across arbitrary distances... or am I missing something?


Transformers (and other feedforward autoregressive models) are obviously Markovian. Here it's shown that these are equivalent to a Markov Chain with a transition matrix.

There's in principle no reason why such can't have arbitrary (but not infinite) range dependencies. The major difference is that these Markov Chains can be too huge to compute and don't have a feasible training algorithm.


That's not especially true. Markov models only care about the last N words of context. In practice, this leads to N being small. Transformers consider more/larger state using self-attention, a more complex mathematical mechanism, which means they integrate all across the entire context window in a richer attention model.


Transformers only care about N words/tokens of context. That was the primary reason they became the dominant language model architecture.

A system being a Markov chain doesn't say anything more about the function mapping inputs to states than that the function has no memory and has access to the full state that affects the transition probabilities to the next state.


Maybe this is at cross-purposes: Functions don't have memory. Nothing real ever "is" a Markov chain. It's simply a (slightly unusual) modelling choice which you can use whenever you've listed all possible inputs that can affect a transition probability - once you've done that, you can apply it to pretty much anything.


> Nothing real ever "is" a Markov chain.

What do you mean? Quantum states over time are very much markov chains as well.

> It's simply a (slightly unusual) modelling choice which you can use whenever you've listed all possible inputs that can affect a transition probability

That has nothing to do with markov chains, do you mean a specific implementation? Markov chains is a probability theory concept and has nothing to do with any implementation.


In many cases the Markov assumption is made to simplify things even though it's not realistic. But nevertheless it does describe what the model itself does even if in the real environment it doesn't act as a real Markov process (e.g. doesn't necessarily have a stationary distribution).

It applies here too in a case where the model interacts with humans, as the LLM state doesn't include the human.

While Markov chains/processes are a powerful and general abstraction, the assumptions are quite strict. E.g. even the simplest recurrent systems aren't Markov, and this leads to things like having to assume exponential distribition of event durations in time series contexts, which is quite limiting.

In language models it draws a clear demarkation between recurrent and feedforward neural networks and brings some crucial understanding into benefits and limitations of e.g. transformers vs recurrent models.


They use O(T^K) states where T is the size of the "vocabulary" (which would in fact be tokens in the case of an LLM) and K is the size of the context window. In practice, this would be an enormous number of states, and would be able to account for all possible long-range dependencies. This bit is trivial.

The paper seems to be more about the stationary distributions of such Markov chains.


> In practice, this would be an enormous number of states

The paper mentions this explicitly:

For GPT-3 (Brown et al., 2020), this represents 5 × 10^11 training tokens, which pales in comparison with the number of non-zero elements in Qf, given by T^K+1 ≈ 10^9632.


I’ve found that it’s entirely too easy to get inconsistent (i.e. self-conflicting) output in the same response from when the likes of gpt4o when you require reasoning and logical deduction on topics completely or mostly alien to it and not found in the massive input corpus. So I’m not sure this is the difference between a markov chain and a transformer llm.


This paper states a completely obvious tautology which is that if you limit the context length, a LLM has a finite state size!


> This paper states a completely obvious tautology which is that if you limit the context length, a LLM has a finite state size!

It has to be stated, since so many even here on HN believes that LLM aren't markov chains since they misunderstand basics like this.


LLMs are nothing like Markov chains I've come across or worked with. I've never used a markov chain trained with anywhere near the number of states you'd need.

This feels like saying "it's just if statements". So yeah, but also thats a terrible mental model for how to use them.


Any system with memoryless finite transition rule (the Markov property) and full observability of the state is a Markov chain. In language models the "traditional" Markov chain model is the very simple small-context table lookup, but in many other fields they can be a lot more complicated. E.g. Markov Chain Monte Carlo and many statistical mechanics applications.

It's mathematically well understood and very poweful abstraction which is lost if they're thought only as the simple toy language model.


What's the Markovian interpretation of the notion of embeddings in a high-dimensional space?

Or of attention?

The term is being tortured beyond endurance by being forced to describe what LLMs do.


> The term is being tortured beyond endurance by being forced to describe what LLMs do.

No it isn't, this is exactly the kind of process that Markov chains describe. They describe a stateless probability function that you chain over and over to produce a sequence of results, exactly like what an LLM does, they predict the next token based on precious tokens repeatedly to produce text.

There is no bending of any concepts here, anyone who studied a higher level math probability course will instantly recognize LLMs as a markov chain. It is common for markov chains to have infinite number of states, like a real position, and then moving randomly from there. You can model particles in a gas like that.


I'm not disagreeing that it's correct, I'm disagreeing that it's meaningful. Calling an LLM a "Markov model" reveals an agenda rather than an insight.


> Calling an LLM a "Markov model" reveals an agenda rather than an insight.

What agenda? Knowing that LLM's are markov chains helps me understand and makes me better at using them, since I know every word is just calculated probabilistically based on the previous N words, the fastest way to describe that is to say it is a Markov chain. This makes it hard for them to coordinate their early words with things that will happen later, all that from just knowing it is a Markov chain.

I feel the knee jerk rejection of this statement is more of an agenda, why do you feel it is so wrong to say it is a Markov chain? Are you sure there is no agenda behind those feelings?


The agenda may well be "pure mathematics".

I don't think "Markov chain" is a useful description despite agreeing it's an accurate one, simply because I have a preference for applied maths over pure maths and a full transition table version would be obscenely large.

In this, it's also technically accurate to say that a perceptron network with one hidden layer can approximate any function: technically yes, but not in any practical sense.

But if you're working in the pure maths domain, knowing that it can in principle be one, means you can apply any tools that only work on Markov chains. Given what else I've seen mathematicians do, I won't be surprised if pure maths researchers can get useful results even though the transition matrix is literally too large to fit into our universe without collapsing it into a black hole — it would hardly be the first time I've even seen mathematicians talking about quantities that large.


The theory of Markov chains (or Markov processes in general) has nothing to say about what happens inside the function that maps observations to predictions of the next state, as long as the function is stateless/memoryless.

It is a different level of analysis than neural network architecture and covers a very wide types of models. The Markov chain structure of a model allows for some kinds of understanding of behavior of the models fulfilling its assumptions, e.g. finite range of dependencies, existence of a stationary distribution and crucially for transformer-type architectures that they can be trained in a purely parallel manner.


@CamperBob2

> The problem I have is that anything deterministic can be described as Markovian.

Markov processes aren't even necessarily deterministic. See Hidden Markov models (https://en.wikipedia.org/wiki/Hidden_Markov_model) and POMDPs (https://en.wikipedia.org/wiki/Partially_observable_Markov_de...).


The problem I have is that anything deterministic can be described as Markovian.

If an algorithm employing 8 billion parameters addressed by a context that is itself cross-linked with a key-value store qualifies as "memoryless" enough to be a Markov model, does the term have any meaning at all? You could argue that it's memoryless because the embeddings aren't modified on the fly based on the user's input, I suppose.

My point is that describing a toy Markov text generator with the same terms that you apply to GPT3.5 is, even if technically accurate, no more meaningful than using the term "Turing machine" to describe a C-64 and a Cray.


It's memoryless because it's a pure function. All feedforward neural networks are memoryless.

This is as opposed to an RNN, which has an internal state.


While this is absolutely correct when the RNN has ℝeal-valued internal state, in concrete implementations the internal state in an RNN has some number of bits and one can draw a Markov graph with a group of 2^n variants of states to represent all possible states of n bits in some part of the system.

Likewise the context window of an LLM can be considered a state with some number of bits. It's just more explicit in an LLM's architecture vs. a RNN's runtime that the number of bits is finite.


> My point is that describing a toy Markov text generator with the same terms that you apply to GPT3.5 is, even if technically accurate, no more meaningful than using the term "Turing machine" to describe a C-64 and a Cray.

What if we used the term "computer"?


> What if we used the term "computer"?

And anthropomorphise the machines? ;)

https://en.wikipedia.org/wiki/Computer_(occupation)


It’s not a particularly useful mental model, but it does drive home a critical point some people have trouble with, namely that LLMs don’t learn when being used.


Current AIs don't learn like us.

But the context length is around what a human goes through in a day so it can feel like it, and they can use tools such as databases to make direct record of "important" things, and fine-tuning on sessions is also already possible.

I'm going to ignore "proper" continuous learning until that actually gets into the big-name models, because although there's plenty of "first 90%" tech demos, the Tesla FSD is also at that point and yet the cars today still come with steering wheels. Meanwhile, here's the research SOTA: https://github.com/Wang-ML-Lab/llm-continual-learning-survey

Ironically, continuously updating a Markov chain is easy, given how small most of them are in practice. But a Markov chain large enough to act like an LLM would collapse into a black hole 7x10^32 times larger than the universe*, so that's kinda hard to update.

* assuming the algebra was correct: http://www.wolframalpha.com/input/?i=sqrt%2812%2A%28%2850000...

Edit: I made at least two errors with that algebra, one of which was a typo; I now think it's 4.2x10^4750 universe diameters…

…unless there's another error I missed: http://www.wolframalpha.com/input/?i=sqrt%2812%2A%28%2850000...


> But a Markov chain large enough to act like an LLM would collapse into a black hole

You mean a full state table Markov Chain? Markov chains doesn't need to have all states in a table, it just needs to be a function, a table is one implementation of a function but anything can do.


Indeed.

Although my understanding is that under quantum mechanics it's necessary to set the probability of all transitions between states to a number larger than zero, I also understand that for practical purposes the probability for the overwhelming majority of transitions are close enough to zero as to require fancy notation (at the very least, nested exponentiation) to express quite how unlikely they are.


While it is obviously trivially (in the mathematical sense of "trivially") possible to make a Markov chain that produces the same f(input context window) => {output distribution}, that doesn't make it useful to describe LLMs as a kind of Markov chain.

After all, in the same sense I can also make a Markov chain that reproduces the behaviour of any human, or even any Turing machine with finite tape — it's just that a computer with 1 GB of memory (all kinds including non-volatile) has 2^(8*(10^30)) states.


So I'm confused.

In my college classes, sometimes we'd get problems involving "construct a markov chain to model this process". Doing the naive, obvious solution wouldn't make the probabilities work because we'd "forget" crucial information, so we'd just add on extra bits to the state (e.g. "flag1 flag2 flag3 ...") that would make the number of nodes exponentially larger.

But the infinite information part of markov chains is really helpful sometimes too, e.g. the position of your random walk wouldn't be encoded in your "state" (which could be infinite), it'd just be returned as a value (if you were trying to compute the distance from the origin or something).

Are you guys just arguing over "at what point does this shoving a billion bits into my state make it not really a 'markov' chain like you'd traditionally think about it"? Basically the formal definition vs what we intuitively think about as the 'point' of markov processes and why the markov observation is important?


> Are you guys just arguing over "at what point does this shoving a billion bits into my state make it not really a 'markov' chain like you'd traditionally think about it"? Basically the formal definition vs what we intuitively think about as the 'point' of markov processes and why the markov observation is important?

That's an accurate description of my perception of the conversation, I can't speak for the others of course.


I've seen this nonsensical equivalence several times now so I'm going to make an example out of you to prove a point. Define the state space for the human Markov chain.


> Define the state space for the human Markov chain.

At most the Berkenstein bound for the mass of a human brain.

Is that big? Yes, too huge to bother calculating. That's why I'm using it as an example to say "it's a Markov chain" is not helpful.

But it is finite, and that means it's mathematically equivalent.

And that's before the thing about transition matrices being (in general if not in particular) square, from any one state to any other.

For fun, here's the Berkenstein bound for the equivalent transition matrix for the much smaller GPT-3 with 2048 context length, measured in (amongst other things) multiples of the size of the universe: http://www.wolframalpha.com/input/?i=sqrt%2812%2A%28%2850000...

(Assuming I didn't fluff the algebra).

Edit: I fluffed the agebra, see if you can spot the typo. It's much much bigger.


What are the states? I didn't say give me a count of the states. I said give me the algorithm for determining whether any given state of the universe corresponds to whatever human you are reducing to a Markov chain. I don't care if it's practical or not because it seems like you're convinced this algorithm exists even though you have never actually gone through the trouble of actually formalizing it. So I'm doing you a favor, you will either learn about your ignorance or you will make your argument algorithmically formal and actually prove that people are equivalent to Markov chains.


Seriously?

Of all the things I've heard people being dismissive of, you're the first I've heard suggest it might possibly be more than trivial to create a map from physical objects to some enumeration.

It's not like simulated physics is somehow a novel field I've had to invent in my head, it's a field that's been performing such mappings, even for quantum systems (on smaller scales than a brain, but you yourself say "I don't care if it's practical or not") for decades already.

In extremis, the Bekenstein bound, you've got a surface which — at a resolution of (IIRC) 2π bits per square Planck area — contains all the information about the interior. There's a lot of ways to map the surface of a sphere into Planck areas, pick any you want. That ordering itself defines a state down to the quantum level.

There's in the order of 10^17 of those Planck areas if a human brain was compressed into a black hole, so that's the maximum information content of any given possible arrangement of matter and energy with the mass of a human brain.

The only thing I'm struggling with right now is why you might be so confident that this is difficult that you're even thinking to try and make an example out of me?


That's a lot of words and yet I have not seen an actual proof of equivalence. Do you even know the definition of a Markov chain?


> Do you even know the definition of a Markov chain?

Dude, I've written several Markov chains, starting aged something like six while still learning to read. (Family had a Commodore 64, one of the books it came with had one as a coding example).

Markov chains are stochastic processes where the transitions between possible nodes have probabilities depending only on the current node, i.e. history is irrelevant.

So, right back'atcha, if you can't see how what I wrote is a proof of equivalence — you've got a way to map reality (of a human brain) onto a number between 0 and ~2^(10^17)*.

Or perhaps, despite your previous question being answered, your actual issue is that you were unaware that the laws of quantum mechanics give you a stochastic processes where the transitions between possible states have probabilities depending only on the current state?

* exponent times whatever the constant of proportionality is


Great, we're making progress. Give me the dimension of the Hilbert space, the wave function, and the unitary operator that corresponds to your daily activities in terms of a Markov chain. Take all the time you want to figure it out.


If you accept such things exist, my point is already made.

∃, ∎


It's just formal game but you still haven't managed to write down the state space nor specify the transition probabilities for a single atom let alone a person.


> It's just formal game but you still haven't managed to write down the state space nor specify the transition probabilities for a single atom let alone a person.

False.

I've given you the state space. It's the one defined by 2^(k*10^17) bits. That's your state space. I don't know why you have a hard time with this concept, it doesn't seem particularly challenging to me.

I've also said, in what I thought was quite clear language, that writing down the transition probabilities for even a system significantly simpler than a human brain would require so much information that by the Bekenstein bound it would require a universe substantially larger than this one to avoid becoming a black hole.

If you're not going to accept that the very specific things you asked for — things which you seem to accept exist — are already sufficient to be proof that quantum mechanics and by extension all of reality including humans are in the set of things defined by Markov chains unless I literally destroy the universe by writing the function down in that fashion, then we're done.

Or are you shifting the goal posts when you wrote this?:

> and the unitary operator that corresponds to your daily activities in terms of a Markov chain

Given that what actually fits into a comment box is either a super-high-level representation where it's transitions at a macro level like "sleep" -> "get up" -> "go to work" etc. or a super-low-level representation like iħ ∂ψ/∂t |ψ> = (H^)|ψ>


So do you realize the absurdity of your original statement or do you want to keep going?


Only in the sense that anything is a markov chain with some sufficiently large state size. It tells you nothing about the behavior.


That's only true for Markov chains with an (effectively) infinite state space. If the state space is finite, such as with an LLM with a finite set of tokens and a finite context length, it's trivial to tell the difference between the behavior of a specific Markov chain and a more sophisticated model. Once you have a specific Markov chain, the state space is fixed, and you can just ask something that requires a larger state space.

The same basic idea can be found everywhere in mathematics and CS. Once you have chosen the value for your parameter, I can choose the value for my parameter to guarantee the desired outcome.


Not anything is a markov chain. Markov chains are distinguished by probabalistic transitions between states, just like llms


LLM transitions are pseudoprobabilistic.


No, LLM just returns a distribution of states, you can use a real random function to make the transition if you want it has nothing to do with the LLM itself, the pseudoprobability function is just an optimization since you need hardware to get really random numbers.


Its because they're speaking colloquially instead of pedantically. i.e. an LLM is just a bag of bytes is pedantically true, but misses enough that I'd expect to get corrections :)


The abstract mentions "stationary distributions". So they try to derive something new from that borderline obvious equivalence. The Google search engine originally used the stationary distribution to rank the importance of search results.


I'm not sure how this distinguishes the process from a table of all possible input context windows(and rng state if you want temperature) with their desired output.

This has been argued as a theoretical(but larger than the universe) model in philosophy for long enough that I wouldn't be surprised if it predates the notion of creating a machine intelligence.

Without any mechanism for generalisation, it can be argued that it is simply a record of an intelligence with the actual mind being whatever created the table

I guess you can think of Markov chains as a form of that table shrunk down to a hash table of the input storing only a set of the most probable outputs for any collisions.

With the right semantic hashing you could consider it analogous to a LLM, but I feel there's a lot leaning on the term 'semantic hashing', because that's where the magic is in the LLM.


Proposal: Lower the resolution of the context window for distant historical tokens. Summarize most important concepts, drop out the rest. Do not try to assess every single token.


And all real-world computers are also just finite state machines. But we obviously model computers as turing machines for a reason.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: