How Transformers Think

How Transformers Think starts with a

00:00/02:48

1 / 3

Quiz

Back to lessons

Why Language Was Hard

by Wizori

Making videos related to most useful AI stuff.

www.builderthinking.com

The viewer will understand why older AI struggled with distant context and how transformers solve that problem with attention.

Loading comments…

Continue your learning — your way

How Transformers Think

4 episodes

How Transformers Think — full transcript

Why Language Was Hard

The viewer will understand why older AI struggled with distant context and how transformers solve that problem with attention.

How Transformers Think starts with a simple shift: attention lets each word look across the whole sequence, instead of losing track in the distance. By the end, you'll know: why old models fade, how attention links context, and what makes transformers work. Before transformers, language models often handled a sentence in a way that made nearby words matter more than distant ones. So if the key clue showed up early and the answer came later, the model could lose the thread before it reached the end. That was the core problem: the system was processing text, but it was not very good at keeping the right relationships alive across distance. You can picture the failure in action when a pronoun, a negation, or a named entity matters several words later, and the model still leans on the wrong local cue. So the transformer changes the flow right there. Instead of giving every word the same weight, it lets the model ask, for this step, which earlier words matter most to the one I am reading now? That is attention in motion. Inside the layer, each token can look at the others and assign stronger or weaker focus. A word about a subject can pull in the subject; a verb can pull in its object; a negation can stay visible when the meaning depends on it. The model is not memorizing a sentence as one blur. It is building a set of relevance scores as it goes. And that is why attention is the trick. It gives the network a way to route information where it is needed, instead of forcing every piece of text through the same narrow path. If you had to identify the key components here, you would point to the tokens, the attention scores, and the updated representations that come out of the layer. So if I ask you what attention does in one sentence, the answer is simple: it lets the model decide which words should influence the current word most strongly. That decision is repeated across many layers, so the sentence gets read, revised, and refined step by step. Now apply that to a new situation. If the sentence is long and the important clue appears near the start, attention gives the model a direct route back to it. If several words matter, it can spread focus across all of them, which is exactly what older sequential models struggled to do. So the big shift is not just speed. It is control over relevance. Attention makes the model behave less like a blind line reader and more like a system that keeps checking what actually matters at each point in the sentence.

From Text to Signals

The viewer will learn how transformers turn words into tokens and numbers, then use those representations to connect meaning across a sentence.

Now that we have the core trick, let’s follow the text as it enters the model. The first thing that happens is not understanding in the human sense. The words are split into tokens, and those tokens are turned into numbers the network can process. Those numbers start as embeddings, which place each token into a vector space where similar usage can sit near similar usage. That means the model is no longer looking at raw text. It is working with signals that carry learned information about how words tend to behave in context. Then the tokens do not stay isolated. Each layer compares them with one another, updates their representations, and passes that richer state forward. The result is that a token at the end of the stack contains more than its own identity; it carries what the surrounding tokens made it mean. If you want to identify the components in this flow, name them in order: tokenization, embeddings, attention, and the stacked layers that refine the signal. Each step changes the same input from plain text into a context-aware representation that the next step can use. One-sentence explanation: the transformer reads by converting words into numeric vectors, letting them interact, and then repeatedly updating those vectors until the sentence meaning is encoded in the final states. That is why the output can depend on the whole sentence, not just the last word it saw. So if I give you a new sentence with a tricky reference, the model does not chase meaning by guessing one word at a time. It builds meaning by carrying forward the relationships it has already computed, and that is the internal flow you want to keep in mind.

How Transformers Think