What Doing Linear Algebra by Hand Showed Me About Transformers

There is a moment in Chapter 2 of Gilbert Strang's textbook where you are asked to solve a 3×3 system by hand.

Not to read about solving it. Not to verify that a library can do it. To actually sit there with a pencil and find the multiplier, subtract the row, watch the entry become zero.

I have been building software for years. I know how to ship things. I know how to call functions. I did not know what it felt like to do mathematics by hand until this year—and I did not expect it to change anything. It did.

I don't want to just use the tools. I want to understand what's underneath and push toward the frontier.

That is how I opened my first post. This is the second. It is about what happens when you actually go underneath—slowly, on paper, with real numbers—and what you start seeing in transformers that you could not see before.

The Problem That Made It Click

The system was straightforward. Three equations, three unknowns. The instruction: reduce to upper triangular form by two row operations. Circle the pivots. Solve by back substitution.

// the system

2x + 3y + z = 8

4x + 7y + 5z = 20

-2y + 2z = 0

I noticed the third equation had no x. That meant I only needed one step before moving to the second column. Small observation. But it mattered—because I was the one looking, not the computer.

Pivot = 2. Entry to kill = 4. Multiplier = 4 ÷ 2 = 2. Subtract 2 times row 1 from row 2. Every entry. Including the right side. The 4 in position (2,1) became zero.

Something about watching that happen by hand was different from watching a function return a result. Back substitution gave z = 1, y = 1, x = 2. I checked it. 2(2) + 3(1) + 1(1) = 8. Correct. And I sat there for a moment thinking: this exact sequence—pivot, multiplier, subtract—is what runs inside every optimizer step of every large language model being trained right now.

Not a metaphor. The same algorithm. Running on hardware I cannot fully imagine, billions of times a second, but doing what I just did with a pencil.

The Factorization Was Hiding There the Whole Time

Section 2.3 reframed everything. Every elimination step is secretly a matrix multiplication. That step where I subtracted 2 times row 1 from row 2? That was multiplying A by an elimination matrix—the identity with a single entry changed: −2 at position (2,1). The subtraction is encoded in the matrix itself.

I built one of these for every elimination step. Chained them together. And then Section 2.6 showed me what that chain actually was:

A = LU

L is lower triangular. U is upper triangular. The entries of L are the multipliers I computed during elimination. They drop in unchanged. No new computation. The factorization was hiding inside the work the whole time.

And the reason this matters beyond elegance: factor A once—cost n³/3. Solve any new right-hand side—cost n² from then on. This is not a clever trick. This is why training a transformer on 175 billion parameters is computationally feasible instead of completely impossible.

Every call to torch.linalg.solve in PyTorch is LU under the hood. Not because someone chose it. Because it is the right algorithm. The one that makes the numbers tractable.

Hallucination Is a Geometry Problem

Before any of the elimination work, Section 2.1 gave me the picture I kept returning to. Ax = b is asking a question: can b be built by combining the columns of A?

The column space of A is every vector that can be reached by some combination of those columns. If b is in the column space, there is a solution. If b is outside—there is not. The columns are directions. If you have three independent directions in 3D space, you can reach anything. If your directions are dependent, your column space is flat. A plane. A line. Some places are simply unreachable.

A language model that hallucinates is not lying. It is doing the geometrically correct thing.

The true answer exists somewhere in the space of all possible responses. But that point is outside the model's column space—the set of responses its weight matrices can actually produce. So it returns the closest point it can reach. Which is wrong. But it is the best it can do given its geometry.

Fine-tuning reshapes the column space. LoRA adds low-rank directions— ΔW = AB—so previously unreachable points become reachable. Sixteen thousand new parameters instead of a million. Same column space expansion. Different cost. Every technique in modern AI alignment is, at some level, column space surgery. I would not have had that language before Chapter 2.

The Dot Product Is Everywhere, and I Mean Everywhere

Chapter 1 gave me the dot product formula. Multiply component by component. Add everything up. One number comes out. I computed dozens by hand. Lengths. Angles. Perpendicularity checks. The Schwarz inequality: |v · w| ≤ ‖v‖ · ‖w‖.

Then I built the attention mechanism:

// attention scores

def attention_scores(Q, K):

d_k = Q.shape[-1]

return Q @ K.T / np.sqrt(d_k)

Q @ K.T is every query vector dotted with every key vector. The attention score between two tokens is a dot product. The model is literally asking: how aligned are these two vectors? How much does what you are looking for match what I contain?

And the 1/√d scaling—that is the Schwarz inequality applied as an engineering constraint. Without it, as embedding dimensions grow, dot products grow proportionally. Softmax saturates. Gradients vanish. Training breaks. The scaling factor is not arbitrary. It is the inequality doing its job.

I recognized that connection only because I had spent time with the inequality by hand. It was not obvious from the formula. It became obvious from the geometry.

AB ≠ BA Is Not a Footnote

Section 2.3 mentions almost in passing: matrix multiplication is not commutative. AB is usually not BA. I verified it with numbers, got different answers, noted it, moved on.

Then I thought about what it means for a 32-layer transformer. Layer 1 operates on raw token embeddings. Layer 2 operates on what layer 1 produced. Layer 7 expects the output of layer 6—a representation that has been processed six times. It was trained on that specific input. It expects that specific shape.

If you swap layer 3 and layer 7, you have changed the product. The matrices do not commute. The model collapses. Not because of a software bug. Not because someone made a design error. Because AB ≠ BA. Reordering a product of matrices changes what it computes. Every neural network in existence has a fixed layer order for this reason.

This felt significant when it clicked—that a property I had verified on a 2×2 example was load-bearing for the entire architecture of deep learning.

Backpropagation Is Two Rules From Chapter 2

Section 2.5 gives the inverse of a product: (AB)¹¹ = B¹¹A¹¹. Reverse order. Socks before shoes—to undo, remove shoes first. Section 2.7 gives the transpose of a product: (AB)⊃T = B⊃TA⊃T. Same pattern. Reverse order.

During the backward pass of a neural network, gradients flow in reverse through the layers. And at each layer, the backward pass applies the transpose of the weight matrix—W⊃T times the gradient from above.

Backpropagation is not a black-box training algorithm. It is those two rules— applied to gradients, flowing backward through a chain of matrix operations. I had been intimidated by backprop for a while. The idea of automatic differentiation through a complex graph felt like its own domain of knowledge. It is. But the linear algebra underneath it is Chapter 2. That is where the foundation is.

Building the Attention Head

At the end of this arc, I built a working attention head. Not in PyTorch. Not using HuggingFace. Using the functions I had written from scratch.

// single attention head — from scratch

def single_head(X, Wq, Wk, Wv):

Q = project(X, Wq) # matrix multiplication

K = project(X, Wk)

V = project(X, Wv)

scores = attention_scores(Q, K) # dot products

weights = softmax(scores)

return weights @ V # linear combination of V

Every function traces back to a concept from these two chapters. project() is matrix-vector multiplication—Section 1.3. attention_scores() is dot products. The final weights @ V is a linear combination—Section 1.1. The whole thing is column space and linear combinations, end to end.

I ran it on four tokens: ‘the’, ‘cat’, ‘sat’, ‘on’. The attention map printed as a 4×4 matrix. Each row summed to 1. Each entry said how much one token was looking at another. I had built the thing that sits at the center of every large language model. From a dot product formula in Chapter 1.

The full code is at github.com/khaledyusuf44/linear_algebra_engine. Five files. No black boxes.

What Doing It by Hand Actually Does

I could have read about all of this. There are good explanations online. Videos with nice animations. Blog posts with diagrams.

But reading about matrix multiplication and computing it by hand are different cognitive events.

When I spent an hour on Problem 12—finding the multiplier, subtracting the row, watching the entry become zero—something about that friction built a different kind of understanding. The number is not abstract when you are the one who computed it. The moment the 4 becomes 0 is not just information. It is something you did.

I am not saying everyone needs to do this. There are faster paths to using transformers effectively. But there is a kind of understanding that comes from friction—from the calculation resisting you, from catching your own arithmetic errors, from building intuition through physical engagement with the numbers—that shortcuts cannot replicate.

When I read a PyTorch file now and see F.scaled_dot_product_attention(Q, K, V), I do not see a black box. I see the dot products. The scaling. The softmax turning scores into weights. The linear combination blending the V vectors. The column space of V limiting what can be expressed.

I see this because I built it. Slowly. By hand.

Chapter 3 is next—vector spaces, the null space, the four fundamental subspaces. The geometry gets deeper from here.