← writing
january 11, 2026

the perceptron: no thoughts, just dot products

Andrew R. Louis

It’s just y = mx + b with a God complex.

1.1

the setup (mars edition)

we're on mars. we have two biological castes of martians:

🔵 chonkers (+1)

absolute units. dense.

built like refrigerators filled with other refrigerators.

🔴 skinny legends (-1)

gracile. hollow-boned.

built like a skittish sentient breadstick.

our job: build a machine that looks at a martian's measurements, then flashes the right light.

we have a clipboard. we start with two measurements:

  • HH = height in cm
  • WW = weight in kg

we write our measurements as a vector:

x=[HW]{\xx \mathbf{x}} = \begin{bmatrix} H \\ W \end{bmatrix}

on the scatter plots below, i’ll put weight on the horizontal axis and height on the vertical, because it reads cleaner. in the algebra, x\mathbf{x} just means “the measurements” — the order is arbitrary as long as w\mathbf{w} uses the same order.

1.2

drawing the line

before we do any math, solve the problem visually.

we want a decision boundary—a line that separates the chonkers (+1+1) from the skinny legends (1-1).

try it yourself. drag the handles on the axes to position the line so it separates the two groups.

now: what is that line, mathematically?

building the decision engine from scratch

how do we turn two numbers into one decision? we invent a score, then choose a cutoff: score>c\text{score} > c means chonk.

look at the scatter. in this toy dataset, the chonkers mostly live to the right: heavier.

so a decent first bet is a score that rises when either measurement rises—move right, score goes up; move up, score goes up.

score=H+W\text{score} = H + W

this score ignores units (cm vs kg). but it buys us two crucial things immediately:

  1. a single scalar we can threshold ("chonker if score > c")
  2. a clean geometry: the decision boundary score=c\text{score}=c becomes a straight line (a diagonal cut between bottom-left and top-right)

now we fix the problems: cm and kg aren’t comparable, and sometimes one feature should count more than the other.

weights are exchange rates

if height should count double, you'd write:

score=2H+W\text{score} = 2H + W

that “2” is literally an exchange rate.

replace the hard-coded 1s with adjustable weights, w1w_1 and w2w_2:

score=w1H+w2W\text{score} = w_1 H + w_2 W

interpretation:

  • one cm of height buys you w1w_1 score points
  • one kg of weight buys you w2w_2 score points

this lets us put heterogeneous measurements on the same scoreboard.

bias: pick the cutoff

we need one more number: a cutoff cc. that’s the line in the sand.

  • if score>c\text{score} > c → chonker
  • if score<c\text{score} < c → skinny legend
  • if score=c\text{score} = c → exactly on the boundary

what does that boundary look like? plot every point where score=c\text{score} = c. you get a straight line—that's what you were dragging around.

that straight line is just what "tied score" looks like in 2D. with three features it's a plane. with a hundred it's a hyperplane.

comparing to cc is the same as moving cc to the left and comparing to zero. define b=cb = -c and write:

score+b>0\text{score} + b > 0

same boundary, cleaner sign check.

sanity check with real numbers

here's the equation you found:

test it on two martians:

  • H=175, W=78 (chonker): . score > cutoff → predicted chonker (+1).

  • H=160, W=52 (skinny legend): . score < cutoff → predicted skinny legend (-1).

three numbers → a machine that can tell them apart.

the general form

the weights w1,w2w_1, w_2 control the angle of the boundary. the bias bb (or equivalently the cutoff c=bc = -b) shifts it.

it works for any number of features. add age, antennas? just add more weighted terms. it's always a flat cut through the space (a "hyperplane").

we can fold the bias into the weights to keep the algebra clean: treat it as a weight w0w_0 attached to a constant input x0=1x_0 = 1. then:

w01+w1H+w2W=0w_0 \cdot 1 + w_1 H + w_2 W = 0

which is a product of two vectors:

[w0w1w2][1HW]=0\begin{bmatrix} w_0 & w_1 & w_2 \end{bmatrix} \begin{bmatrix} 1 \\ H \\ W \end{bmatrix} = 0

so if we define:

x=[1HW],w=[w0w1w2]{\xx \mathbf{x}} = \begin{bmatrix} 1 \\ H \\ W \end{bmatrix}, \quad {\ww \mathbf{w}} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}

the whole machine is:

wx=0{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0

one vector of weights, one vector of inputs. multiply and sum.

1.3

the geometry of the dot product

we have this equation wx=0{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0. what does it look like?

ignore the bias for a second. together, (w1,w2)(w_1, w_2) form a vector — a direction in height-weight space. that's where the machine is looking.

the score measures how well a point x\mathbf{x} aligns with that direction:

score=wx=wxcosθ\text{score} = \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta

where θ\theta is the angle between them.

  • if they point in roughly the same direction (θ<90\theta < 90^\circ), the score is positive.
  • if they point in opposite directions (θ>90\theta > 90^\circ), the score is negative.
  • if they are exactly perpendicular (θ=90\theta = 90^\circ), the score is zero.

and that last case is the boundary. score=0\text{score} = 0 means "perpendicular to w\mathbf{w}." every point perpendicular to the weight vector scores zero, and that set of points forms a line (in 2d), a plane (in 3d), a hyperplane (in general).

learning is just rotating the flashlight so that the beam illuminates the chonkers and leaves the skinny legends in the dark.

the weights are the classifier. change the weights, you rotate the boundary.

we start with random weights and watch the first mistake happen.

1.4

the incident (an actual example with actual numbers)

our machine boots up. the knobs are set randomly:

w=[50.10.3]\mathbf{w} = \begin{bmatrix} 5 \\ 0.1 \\ -0.3 \end{bmatrix}

meaning:

  • bias = 5 (starts chonk-leaning)
  • height coefficient = 0.1 (taller → slightly more chonk-leaning)
  • weight coefficient = -0.3 (heavier → MORE skinny-leaning??? backwards, but it's random)

a martian waddles in:

  • ground truth: this is a CHONKER (+1)
  • measurements: height = 170 cm, weight = 76 kg
x=[117076]\mathbf{x} = \begin{bmatrix} 1 \\ 170 \\ 76 \end{bmatrix}

the machine computes:

score=51+0.1170+(0.3)76\text{score} = 5 \cdot 1 + 0.1 \cdot 170 + (-0.3) \cdot 76

step by step:

  • bias term: 5×1=55 \times 1 = 5
  • height term: 0.1×170=170.1 \times 170 = 17
  • weight term: 0.3×76=22.8-0.3 \times 76 = -22.8
score=5+1722.8=0.8\text{score} = 5 + 17 - 22.8 = \mathbf{-0.8}

score is negative. machine predicts skinny legend (-1).

but this martian is a CHONKER.

false negative.

geometrically: wx<0\mathbf{w}^\top\mathbf{x} < 0, so this point is on the “negative” side of the boundary — behind the beam.

1.5

learning from mistakes (the update)

our machine called a chonker a skinny legend. the beam is pointing the wrong way. we need to change the weights.

our incident chonker (y=+1y=+1) got a score of -0.8. multiply them together and you get -0.8. when the prediction and label disagree, yscorey \cdot \text{score} is always negative.

take one training example (x,y)(\mathbf{x}, y) where y{1,+1}y \in \{-1, +1\}.

  • score = wx{\ww \mathbf{w}}^\top {\xx \mathbf{x}}
  • prediction = sign(score)\text{sign}(\text{score})

when the prediction and the label agree, yscorey \cdot \text{score} is positive (two positives or two negatives). when they disagree, it's negative:

y(wx)<0y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) < 0

what change to w\mathbf{w} makes y(wx)y(\mathbf{w}^\top\mathbf{x}) go up the fastest?

suppose we add some Δw\Delta\mathbf{w}:

y((w+Δw)x)=y(wx)+y(Δwx)y(({\ww \mathbf{w}} + \Delta{\ww \mathbf{w}})^\top {\xx \mathbf{x}}) = y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + y(\Delta{\ww \mathbf{w}}^\top {\xx \mathbf{x}})

the first term is the old mistake. the second term is the only knob we can turn.

the score is just a simple sum of multiplications. if you nudge one weight by some amount Δwi\Delta w_i, the overall score shifts by exactly that amount multiplied by the input xix_i.

so the example itself tells you where the leverage is. for our incident x=[1,170,76]\mathbf{x}=[1,170,76]:

  • add 1 to w0w_0 → score changes by 1
  • add 1 to w1w_1 → score changes by 170
  • add 1 to w2w_2 → score changes by 76

so we want Δw\Delta\mathbf{w} to have a big positive dot product with yxy\mathbf{x}.

if you hold the step size fixed (say, Δw=1\|\Delta\mathbf{w}\| = 1), the dot product Δw(yx)=Δwxcosθ\Delta\mathbf{w}^\top (y\mathbf{x}) = \|\Delta\mathbf{w}\|\,\|\mathbf{x}\|\cos\theta is maximized at θ=0\theta = 0 — when Δw\Delta\mathbf{w} points in the same direction as yxy\mathbf{x}. just like the flashlight beam in 1.3, we want to physically rotate the weights to point exactly at the missed example.

point Δw\Delta\mathbf{w} directly at yxy\mathbf{x}, scaled by η\eta (the learning rate).

ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta \cdot y \cdot {\xx \mathbf{x}}
the perceptron update (error-driven)

when y=+1y=+1 you add x\mathbf{x}; when y=1y=-1 you subtract x\mathbf{x}. same rule.

that’s the update: when you’re wrong, move toward the labeled point.

in pseudocode:

initialize w
repeat for epochs:          # one epoch = one pass through all points
  for (x, y) in data:
    if y * dot(w, x) < 0:    # mistake
      w = w + η * y * x
1.6

why this helps (the one-step guarantee)

without this trick, we'd need two cases: "if y=+1y=+1, check that the score is positive; if y=1y=-1, check that it's negative." multiplying by yy collapses both into one number — the signed margin — that should always be positive:

m=y(wx)m = y \cdot ({\ww \mathbf{w}}^\top {\xx \mathbf{x}})
  • m>0m > 0: correct side
  • m<0m < 0: wrong side (update fires)

after an update ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}:

mnew=y((w+ηyx)x)=y(wx)+ηy2x2=mold+ηx2(since y2 is always exactly 1)\begin{aligned} m_{new} &= y(({\ww \mathbf{w}} + \eta y {\xx \mathbf{x}})^\top {\xx \mathbf{x}}) \\ &= y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + \eta \cdot y^2 \cdot \|{\xx \mathbf{x}}\|^2 \\ &= m_{old} + \eta \|{\xx \mathbf{x}}\|^2 \quad \text{(since } y^2 \text{ is always exactly 1)} \end{aligned}
signed-margin improvement on that example

since η>0\eta > 0 and x2>0\|{\xx \mathbf{x}}\|^2 > 0 for any real input, the signed margin strictly increases on that example.

one update makes you less wrong on that point. it doesn't promise the whole dataset stops.

can the dataset keep forcing updates forever?

1.7

the inevitable crash (why it stops)

if the data is separable, the perceptron cannot loop forever. our machine sorting martians must eventually stop misclassifying them.

not "probably won't." cannot.

we can prove this by tracking two physical quantities that are on a collision course.

the mandate (must grow fast)

the update adds ηyx\eta y\mathbf{x}, which always has a positive component along u\mathbf{u} (at least ηγ\eta\gamma, by definition of margin).

wuk(ηγ)\mathbf{w}^\top \mathbf{u} \ge k \cdot (\eta \gamma)

the projection grows linearly with mistakes kk.

the budget (can only grow slow)

every mistake adds at most a bounded amount of "energy" (squared length) to w\mathbf{w}.

w2k(ηR)2\|\mathbf{w}\|^2 \le k \cdot (\eta R)^2

the length grows at most like k\sqrt{k}.

(drag the slider).

  • the mandate (dot product) demands linear growth (kk).
  • the budget (length) only allows square-root growth (k\sqrt{k}).

but wuwu=w\mathbf{w}^\top \mathbf{u} \le \|\mathbf{w}\|\,\|\mathbf{u}\| = \|\mathbf{w}\| (the formal name for this ceiling is the cauchy-schwarz inequality, but geometrically it just means what we saw in 1.3: cosθ\cos \theta maxes out at 1). a vector's projection onto a unit vector can never exceed the vector's own length.

wuw\mathbf{w}^\top \mathbf{u} \le \|\mathbf{w}\|

so:

k(ηγ)k(ηR)k \cdot (\eta \gamma) \le \sqrt{k} \cdot (\eta R)

a linear function of kk will always eventually overtake a square root of kk. but a projection can never outgrow the vector itself. so the only escape is that kk stops growing: the algorithm must run out of mistakes.

let's pin down the number. the "collision" happens when the lower bound hits the upper bound:

kηγkηRk \eta \gamma \le \sqrt{k} \eta R

divide by ηk\eta \sqrt{k}:

kγR\sqrt{k} \gamma \le R

square both sides:

kγ2R2    kR2γ2k \gamma^2 \le R^2 \implies k \le \frac{R^2}{\gamma^2}

so the number of mistakes kk is bounded by (R/γ)2(R/\gamma)^2. calculating for eternity is physically impossible.

if the data isn't separable (no such u\mathbf{u} exists), the mandate doesn't exist — there's no γ>0\gamma > 0 to force growth — and the perceptron can thrash forever. that's when we use the pocket algorithm (keep the best weights seen so far) or switch to a soft-margin classifier like an SVM.

1.8

making it work

the convergence proof says the perceptron must stop. it doesn't say it stops gracefully.

remember our incident martian from 1.4? x=[1,170,76]\mathbf{x} = [1,\, 170,\, 76]. with η=1\eta = 1, the margin jump on each update is ηx2=12+1702+762=34,677\eta\|{\xx \mathbf{x}}\|^2 = 1^2 + 170^2 + 76^2 = 34{,}677. one mistake and the weights are in orbit.

why so violent? because height is in centimeters and weight is in kilograms. the height coordinate is roughly 100×100\times louder than everything else, so it dominates the squared norm. the convergence bound is (R/γ)2(R/\gamma)^2 — if RR is inflated by one unscaled feature, the bound blows up quadratically.

the fix is boring and universal: standardize features (xi(xiμi)/σix_i \leftarrow (x_i - \mu_i)/\sigma_i) so no single dimension dominates. after standardization, our martian's coordinates are order-1 instead of order-100, the updates stay sane, and the convergence bound tightens.

the recipe

  • standardize features: xi(xiμi)/σix_i \leftarrow (x_i - \mu_i)/\sigma_i (save μi,σi\mu_i,\sigma_i).
  • add bias: include x0=1x_0 = 1 so the boundary can shift.
  • shuffle each epoch: same points, fewer dumb cycles.
  • early stop: stop when mistakes hit 0 (separable) or validation stops improving (non-separable).
  • average or pocket: when in doubt, ship wˉ\bar{\mathbf{w}} or the best w\mathbf{w} seen.

what if the data can't be separated by any line?

1.9

escaping flatland (xor + why neural nets exist)

four points at the corners of a square. label them by diagonal: one diagonal pair is +1, the other is -1.

ABoutput
00-1
01+1
10+1
11-1

try to draw a line that separates them.

you can't. sweep the boundary through every angle — every orientation catches one +1 and one -1 on each side. the perceptron will oscillate forever, shoving the boundary back and forth, never settling.

a line can only cut the plane into two half-planes, and no half-plane contains exactly one diagonal of a square. the +1s and -1s alternate around the corners.

the score w1A+w2Bw_1 A + w_2 B sees each input's contribution independently — it knows "A is big" and "B is big" as separate facts.

what it can't see is whether A and B agree. agreement between A and B isn't a direction — it's a relationship between directions, invisible to any weighted sum.

there are two ways out:

  1. change the representation (feature engineering)
  2. learn the representation (a neural network)
1.9.1

the feature-map escape hatch (make xor separable)

if the machine can't draw the boundary using A and B, we have to invent a new input that captures what it's missing.

the machine is looking at A and B as separate facts. what it can't see is whether they agree. we need to turn "agreement" into a raw number we can hand to it.

look at the inputs. they are 0s and 1s. how do you mathematically combine two 0s and 1s into a single number that screams "both are on"?

  • you can't just add them (A+BA+B). 1+01+0 gives you 1, but so does 0+10+1. the machine couldn't tell the difference between "only A is on" and "only B is on."
  • you can't subtract them.

but if you multiply them (ABA \cdot B), you get a 1 if and only if both A and B are 1. every other combination yields 0.

by manually computing ABA \cdot B and feeding it in alongside the original inputs, we give the machine a dedicated knob for that relationship. we literally pop the 2D plane into a 3D space, lifting the (1,1) corner up so a flat sheet of glass can slice cleanly underneath it.

this preprocessing step is called a feature map:

ϕ(A,B)=(1,A,B,AB)\phi(A,B) = (1, A, B, A\cdot B)

the mapped points are:

pointϕ(A,B)\phi(A,B)label
(0,0)(1,0,0,0)(1, 0, 0, 0)-1
(0,1)(1,0,1,0)(1, 0, 1, 0)+1
(1,0)(1,1,0,0)(1, 1, 0, 0)+1
(1,1)(1,1,1,1)(1, 1, 1, 1)-1

one separating weight vector is w=[1,2,2,4]\mathbf{w} = [-1, 2, 2, -4]. check:

  • (0,0)(0,0): 1+0+0+0=1-1 + 0 + 0 + 0 = -1 → -1 ✓
  • (0,1)(0,1): 1+0+2+0=1-1 + 0 + 2 + 0 = 1 → +1 ✓
  • (1,0)(1,0): 1+2+0+0=1-1 + 2 + 0 + 0 = 1 → +1 ✓
  • (1,1)(1,1): 1+2+24=1-1 + 2 + 2 - 4 = -1 → -1 ✓

translation: “if BOTH inputs are 1, subtract 4 from the score.” that feature makes XOR separable.

but this is the catch: you had to know to add ABA\cdot B. the perceptron can't see features it wasn't given — ABA \cdot B was in the beam's blind spot. it can only learn weights for the dimensions you hand it.

that ABA\cdot B term is the simplest interaction feature: it turns “both are on” into a single coordinate.

in the martian story, HWH\cdot W is the same move. it gives you a knob for “tall and heavy” that a purely linear score can’t express directly.

1.9.2

the question for next time

the feature map worked — but only because we knew ABA\cdot B was the right feature.

for XOR that's trivial — two inputs, one pairwise interaction to try. for our martians, we could guess HWH\cdot W. but for a real problem with a hundred measurements, the pairwise interactions alone number (1002)=4,950\binom{100}{2} = 4{,}950. three-way? 161,700161{,}700. you're not hand-crafting your way out of that.

the perceptron can learn any boundary in the space you give it. but it can't see past that space. if the feature that matters is a combination you didn't think to include, no amount of weight-nudging will find it. the flashlight can only illuminate dimensions that exist.

what if the machine could build its own dimensions?

that's the idea behind a neural network. instead of you designing ABA\cdot B and hoping, you give the machine a stack of layers. each layer computes weighted sums — exactly like a perceptron — then bends the result through a nonlinearity (like clamping negatives to zero). the early layers combine raw inputs into new features. the later layers combine those into higher-order ones.

the network doesn't know it needs ABA\cdot B. it builds something that works like ABA\cdot B because the training signal rewards it. the feature map isn't designed. it's grown.

the final layer is still a dot product. everything we built in this essay — the score, the boundary, the sign check — is still there at the output. a neural network is a perceptron whose inputs are learned features rather than raw measurements.

but stacking layers creates a problem the perceptron never had.

in a single perceptron, when the machine miscalls a chonker, we know exactly who to blame: the one weight vector w\mathbf{w}. we nudge it toward yxy\mathbf{x} and move on. one knob, one gradient, one step.

in a network with three layers and thousands of weights, the output depends on all of them. when the prediction is wrong, which layer screwed up? the first layer, which built a bad feature? the second, which combined good features badly? the last, which drew the boundary in the wrong place? every weight had a hand in the output, and each one deserves a different share of the blame.

credit assignment — which weight to blame, by how much. that's the next essay.

1.10

tl;dr (for the adhd crew)

  1. the machine: weighted sum of inputs → is it positive?
  2. the problem: we don't know the right weights
  3. the fix: when wrong, ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}
  4. why it helps: margin increases by ηx2\eta \|{\xx \mathbf{x}}\|^2 on the mistake example
  5. the catch: convergence needs separability (otherwise it can thrash)
  6. the deeper catch: if your representation is wrong (xor), no amount of linear sweat will save you
  7. the open question: neural nets learn their own features — but when the output is wrong, which weights do you blame? (next essay: backpropagation)
1.11

exercises (pencil + paper)

the goal: you should be able to do these with a pencil, and you should know you're right at the end.

1One Update, No Excuses
Question
  • start with w=[2,1,0.5]\mathbf{w} = [2, -1, 0.5] (bias, height, weight)
  • one martian: x=[1,4,3]\mathbf{x} = [1, 4, 3], true label y=+1y = +1
  • learning rate: η=1\eta = 1

do this by hand:

  1. compute score=wx\text{score} = \mathbf{w}^\top \mathbf{x} and the predicted label
  2. if it's wrong, apply the update ww+ηyx\mathbf{w} \leftarrow \mathbf{w} + \eta y \mathbf{x}
  3. recompute the score with the new weights

score = 2+(1)4+0.53=0.52 + (-1)\cdot 4 + 0.5 \cdot 3 = -0.5, so prediction is -1 (wrong). update:

wnew=[2,1,0.5]+(+1)[1,4,3]=[3,3,3.5]\mathbf{w}_{new} = [2, -1, 0.5] + (+1)[1,4,3] = [3, 3, 3.5]

new score = 3+34+3.53=25.53 + 3\cdot 4 + 3.5\cdot 3 = 25.5 (positive), so it flips to +1. you can also verify the margin jump:

mnew=mold+ηx2=(0.5)+(12+42+32)=25.5m_{new} = m_{old} + \eta \|\mathbf{x}\|^2 = (-0.5) + (1^2 + 4^2 + 3^2) = 25.5
2Bias Matters
Question

consider two points with a single feature xx:

  • xA=2x_A = 2, yA=+1y_A = +1
  • xB=5x_B = 5, yB=1y_B = -1

now do two mini-cases:

  1. no bias term: you are forced to use score=w1x\text{score} = w_1 x (boundary must pass through 0)
  2. with bias: you can use score=w0+w1x\text{score} = w_0 + w_1 x

with no bias, the boundary is w1x=0w_1 x = 0, so x=0x = 0. both points have positive x, so they always land on the same side and can't be separated.

with bias, pick any threshold between 2 and 5, e.g. boundary x=3.5x = 3.5. one valid choice:

w0=3.5,  w1=1w_0 = 3.5,\; w_1 = -1

then scoreA=3.52=1.5>0\text{score}_A = 3.5 - 2 = 1.5 > 0 (A is +1) and scoreB=3.55=1.5<0\text{score}_B = 3.5 - 5 = -1.5 < 0 (B is -1).

3When It Doesn't Converge (Pocket vs Averaged)
Question

this dataset is not separable on purpose: two examples share identical features but disagree on the label.

use a 2d input with bias: x=[1,x1,x2]\mathbf{x} = [1, x_1, x_2].

training set (in this exact order), with η=1\eta = 1 and starting weights w=[0,0,0]\mathbf{w} = [0,0,0]:

  • A: xA=[1,1,1]\mathbf{x}_A = [1, 1, 1], yA=+1y_A = +1
  • B: xB=[1,1,2]\mathbf{x}_B = [1, -1, 2], yB=+1y_B = +1
  • C: xC=[1,1,1]\mathbf{x}_C = [1, 1, 1], yC=1y_C = -1 (label noise: same features as A, opposite label)

your job:

  1. run one epoch and write down w\mathbf{w} after A, after B, and after C
  2. compute the pocket weight: among the weights you visited, which one makes the fewest mistakes on A/B/C?
  3. compute the averaged weight: wˉ\bar{\mathbf{w}} = average of the three post-step weights, and count its mistakes on A/B/C

run the epoch:

  • start w=[0,0,0]\mathbf{w} = [0,0,0]
  • A is misclassified (score = 0 → -1), so update: ww+(+1)[1,1,1]=[1,1,1]\mathbf{w} \leftarrow \mathbf{w} + (+1)[1,1,1] = [1,1,1]
  • B is correct under [1,1,1][1,1,1] (score = 1+(1)+2=21 + (-1) + 2 = 2), so no update (still [1,1,1][1,1,1])
  • C is misclassified under [1,1,1][1,1,1] (score = 3 → +1 but y=1y=-1), so update: w[1,1,1]+(1)[1,1,1]=[0,0,0]\mathbf{w} \leftarrow [1,1,1] + (-1)[1,1,1] = [0,0,0]

so the visited weights are:

  • after A: [1,1,1][1,1,1]
  • after B: [1,1,1][1,1,1]
  • after C: [0,0,0][0,0,0]

pocket: evaluate mistakes on A/B/C.

  • with [1,1,1][1,1,1]: A ✓, B ✓, C ✗ → 1 mistake
  • with [0,0,0][0,0,0]: A ✗, B ✗, C ✓ → 2 mistakes

so pocket keeps [1,1,1][1,1,1].

averaged: wˉ=[1,1,1]+[1,1,1]+[0,0,0]3=[2/3,2/3,2/3]\bar{\mathbf{w}} = \frac{[1,1,1] + [1,1,1] + [0,0,0]}{3} = [2/3, 2/3, 2/3]. this predicts A ✓, B ✓, C ✗ → 1 mistake.

the moral: when the data is non-separable, the last iterate can be garbage. pocket/average gives you something stable to ship.

BonusEscape the Number Line
Question

here's a 1D dataset:

xxlabel
0-1
1+1
2+1
3-1

the +1s are sandwiched between the -1s.

your job:

  1. explain why no single threshold w0+w1x=0w_0 + w_1 x = 0 can separate these points
  2. propose a feature map ϕ(x)\phi(x) that makes them separable (hint: think quadratic)
  3. find a weight vector w\mathbf{w} that works, and verify it on all four points

a single threshold splits the number line into two half-lines: everything left of the cut, and everything right. but the +1s live in the middle. you need a feature that "closes" around the +1s — something that's big in the middle and small at the edges (or vice versa).

1. a linear boundary w0+w1x=0w_0 + w_1 x = 0 is a single cut point on the number line. one side is +1, the other is -1. but the +1 region here is an interval [1,2][1,2], not a half-line. you'd need two cuts — one threshold can't do it.

2. use ϕ(x)=(1,x,x2)\phi(x) = (1, x, x^2). the quadratic term lets the boundary curve.

3. try w=[1,3,1]\mathbf{w} = [-1, 3, -1]. compute wϕ(x)=1+3xx2\mathbf{w}^\top \phi(x) = -1 + 3x - x^2:

  • x=0x=0: 1+00=1-1 + 0 - 0 = -1 → -1 ✓
  • x=1x=1: 1+31=1-1 + 3 - 1 = 1 → +1 ✓
  • x=2x=2: 1+64=1-1 + 6 - 4 = 1 → +1 ✓
  • x=3x=3: 1+99=1-1 + 9 - 9 = -1 → -1 ✓

the quadratic x2+3x1-x^2 + 3x - 1 is a downward parabola — positive at x=1x=1 and x=2x=2, negative at the endpoints. the perceptron can't invent x2x^2 — you had to add it — but once you do, the problem is linearly separable in the lifted space.

Bonus 2Implement the Loop
Question

using the 5-bullet recipe from §1.8 as your spec, write the full training loop. here's a small dataset to test it on:

  • A = (1,1,1)(1, 1, 1), y=+1y = +1
  • B = (1,2,1)(1, 2, 1), y=+1y = +1
  • C = (1,0.5,2)(1, 0.5, 2), y=1y = -1

(already includes bias.) run 3 full passes through the data with η=1\eta = 1, processing in order A, B, C each pass. report:

  1. final weights
  2. pocket weights (best seen)
  3. averaged weights

the recipe bullets are the spec. pocket = keep the weight vector with fewest mistakes on the full dataset. average = running sum of every weight vector seen, divided by total steps.

w = [0, 0, 0]
w_sum = [0, 0, 0]; t = 0
w_pocket = w; best_mistakes = inf

for epoch in 1..N:          # one pass through all points
  # (shuffle in production; fixed order here for hand-tracing)
  mistakes = 0
  for (x, y) in data:
    score = dot(w, x)
    pred = +1 if score > 0 else -1
    if pred != y:
      w = w + eta * y * x
      mistakes += 1
    w_sum += w; t += 1

  total_mistakes = count_mistakes(w, all_data)
  if total_mistakes < best_mistakes:
    w_pocket = w; best_mistakes = total_mistakes
  if mistakes == 0: break

w_avg = w_sum / t
return w, w_pocket, w_avg

hand-trace, pass 1 (w starts at [0,0,0][0,0,0]):

steppointwx\mathbf{w}^\top \mathbf{x}predlabelmistake?w\mathbf{w} after
1A (1,1,1)(1,1,1), y=+1y=+1001-1+1+1yes[1,1,1][1,1,1]
2B (1,2,1)(1,2,1), y=+1y=+11+2+1=41+2+1=4+1+1+1+1no[1,1,1][1,1,1]
3C (1,0.5,2)(1,0.5,2), y=1y=-11+0.5+2=3.51+0.5+2=3.5+1+11-1yes[0,0.5,1][0, 0.5, -1]

after pass 1: w=[0,0.5,1]\mathbf{w} = [0, 0.5, -1]. check all points:

  • A: 0+0.51=0.50 + 0.5 - 1 = -0.5 → pred 1-1, label +1+1 — wrong
  • B: 0+11=00 + 1 - 1 = 0 → pred 1-1, label +1+1 — wrong
  • C: 0+0.252=1.750 + 0.25 - 2 = -1.75 → pred 1-1, label 1-1 — correct

2 mistakes on full dataset. pocket stores [0,0.5,1][0, 0.5, -1] with 2 mistakes (better than \infty, the starting count).

continue passes 2-3 the same way. your implementation should reproduce this trace for pass 1.

pass 2 (w starts at [0,0.5,1][0, 0.5, -1]):

steppointwx\mathbf{w}^\top \mathbf{x}predlabelmistake?w\mathbf{w} after
4A0+0.51=0.50 + 0.5 - 1 = -0.51-1+1+1yes[1,1.5,0][1, 1.5, 0]
5B1+3+0=41 + 3 + 0 = 4+1+1+1+1no[1,1.5,0][1, 1.5, 0]
6C1+0.75+0=1.751 + 0.75 + 0 = 1.75+1+11-1yes[0,1,2][0, 1, -2]

end of pass 2: w=[0,1,2]\mathbf{w} = [0, 1, -2]. check all: A 1-1 (wrong), B 010 \to -1 (wrong), C 3.5-3.5 (correct). 2 mistakes — no pocket improvement.

pass 3 (w starts at [0,1,2][0, 1, -2]):

steppointwx\mathbf{w}^\top \mathbf{x}predlabelmistake?w\mathbf{w} after
7A0+12=10 + 1 - 2 = -11-1+1+1yes[1,2,1][1, 2, -1]
8B1+41=41 + 4 - 1 = 4+1+1+1+1no[1,2,1][1, 2, -1]
9C1+12=01 + 1 - 2 = 01-11-1no[1,2,1][1, 2, -1]

end of pass 3: w=[1,2,1]\mathbf{w} = [1, 2, -1]. check all: A =2= 2 ✓, B =4= 4 ✓, C =01= 0 \to -1 ✓. 0 mistakes. pocket updates to [1,2,1][1, 2, -1].

(note: point C lands exactly on the boundary with score 0. our code’s tie-breaker else -1 saves us here, but mathematically, you'd usually want a strictly non-zero score to claim true separation!)

final answers:

  1. final weights: [1,2,1][1, 2, -1]
  2. pocket weights: [1,2,1][1, 2, -1] (0 mistakes — converged, so pocket = final)
  3. averaged weights: wˉ=19t=19wt=[7/9,  12.5/9,  4/9][0.78,  1.39,  0.44]\bar{\mathbf{w}} = \frac{1}{9}\sum_{t=1}^{9} \mathbf{w}_t = [7/9,\; 12.5/9,\; -4/9] \approx [0.78,\; 1.39,\; -0.44] — gets A and B right but misclassifies C (1 mistake)