← writing
january 11, 2026

the perceptron: no thoughts, just dot products

Andrew R. Louis

It’s just y = mx + b with a God complex.

1.1

the setup (mars edition)

we're on mars. we have two biological castes of martians:

🔵 chonkers (+1)

absolute units. dense.

built like refrigerators filled with other refrigerators.

🔴 skinny legends (-1)

gracile. hollow-boned.

built like a skittish sentient breadstick.

our job: build a machine that looks at a martian's measurements, then flashes the right light.

we have a clipboard. we start with two measurements:

  • HH = height in cm
  • WW = weight in kg

we write our measurements as a vector, which, for now, just means the measurements stacked in a bracket:

x=[HW]{\xx \mathbf{x}} = \begin{bmatrix} H \\ W \end{bmatrix}

on the scatter plots below, i’ll put weight on the horizontal axis and height on the vertical, because it reads cleaner. in the algebra, x{\xx \mathbf{x}} just means “the measurements”; the order is arbitrary as long as w{\ww \mathbf{w}} uses the same order.

1.2

drawing the line

before we do any math, solve the problem visually.

we want a decision boundary: a line that separates the chonkers (+1+1) from the skinny legends (1-1).

try it yourself. drag the two handles to position the line so it separates the two groups.

now: what is that line, mathematically?

building the decision engine from scratch

how do we turn two numbers into one decision? we invent a score, then choose a cutoff: score>c\text{score} > c means chonk.

look back at the scatter you just dragged a line through. in this toy dataset, the chonkers mostly live to the right: heavier.

so a decent first bet is a score that rises when either measurement rises: move right, score goes up; move up, score goes up.

score=H+W\text{score} = H + W

this score adds centimeters to kilograms like they're the same thing. we'll fix that. but it already collapses the clipboard to a single number we can threshold ("chonker if score > c"), and the threshold has a shape: the points where score=c\text{score} = c form a straight diagonal line. that's the thing you just drew, rediscovered as algebra.

two problems remain: cm and kg aren’t comparable, and sometimes one feature should count more than the other.

weights are exchange rates

if height should count double, you'd write:

score=2H+W\text{score} = 2H + W

that “2” is literally an exchange rate.

replace the hard-coded 1s with adjustable weights, w1w_1 and w2w_2:

score=w1H+w2W\text{score} = w_1 H + w_2 W

interpretation:

  • one cm of height buys you w1w_1 score points
  • one kg of weight buys you w2w_2 score points

this lets us put heterogeneous measurements on the same scoreboard.

bias: pick the cutoff

we need one more number: a cutoff cc. that’s the line in the sand.

  • if score>c\text{score} > c → chonker
  • if score<c\text{score} < c → skinny legend
  • if score=c\text{score} = c → exactly on the boundary

what does that boundary look like? plot every point where score=c\text{score} = c. you get a straight line: that's what you were dragging around.

that straight line is just what "tied score" looks like in 2D. with three features it's a plane. with a hundred it's a hyperplane.

comparing to cc is the same as moving cc to the left and comparing to zero. define b=cb = -c and write:

score+b>0\text{score} + b > 0

same boundary, cleaner sign check.

sanity check with real numbers

here's the equation you found:

test it on two martians:

  • H=175, W=78 (a chonker):

  • H=160, W=52 (a skinny legend):

the chonker should score above the cutoff and the skinny legend below it. if either verdict reads wrong, your line probably needs work: scroll up and drag until both come out right. (one caveat: the printed check rounds your coefficients, so a line that only barely squeaks past a point can round the wrong way. give the boundary some breathing room.) the machine has no judgment beyond the three numbers you picked.

three numbers → a machine that can tell them apart.

the general form

the weights w1,w2w_1, w_2 control the angle of the boundary. the bias bb (or equivalently the cutoff c=bc = -b) shifts it.

it works for any number of features. add age, antennas? just add more weighted terms. it's always a flat cut through the space (a "hyperplane").

we can fold the bias into the weights to keep the algebra clean: treat it as a weight w0w_0 attached to a constant input x0=1x_0 = 1. then:

w01+w1H+w2W=0w_0 \cdot 1 + w_1 H + w_2 W = 0

which is a product of two vectors:

[w0w1w2][1HW]=0\begin{bmatrix} w_0 & w_1 & w_2 \end{bmatrix} \begin{bmatrix} 1 \\ H \\ W \end{bmatrix} = 0

so if we define:

x=[1HW],w=[w0w1w2]{\xx \mathbf{x}} = \begin{bmatrix} 1 \\ H \\ W \end{bmatrix}, \quad {\ww \mathbf{w}} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}

the whole machine is:

wx=0{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0

one vector of weights, one vector of inputs. multiply and sum. (that's all the raised \top means: tip the column of weights onto its side so it can pair up with the inputs. wherever you see wx{\ww \mathbf{w}}^\top{\xx \mathbf{x}}, read "multiply matching entries, add them up.")

1.3

the geometry of the dot product

we have this equation wx=0{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0. what does it look like?

ignore the bias for a second. so far a vector has been a column of numbers. plot those numbers as coordinates (first number along one axis, second along the other) and the same vector becomes an arrow from the origin to that point. one object, two costumes. the arrow's length gets written x\|{\xx \mathbf{x}}\|, and it's just pythagoras on the column: x=x12+x22\|{\xx \mathbf{x}}\| = \sqrt{x_1^2 + x_2^2}. that is all the double bars ever mean, here or anywhere below.

together, (w1,w2)(w_1, w_2) form a vector: an arrow pointing somewhere in height-weight space. that's where the machine is looking.

the score measures how well a point x{\xx \mathbf{x}} aligns with that direction:

score=wx=wxcosθ\text{score} = {\ww \mathbf{w}} \cdot {\xx \mathbf{x}} = \|{\ww \mathbf{w}}\| \|{\xx \mathbf{x}}\| \cos \theta

where θ\theta is the angle between the two arrows. if trig is a distant memory: cosθ\cos \theta is a dial for agreement. it reads 1 when the arrows point the same way, slides to 0 as they approach 90°, and hits −1 when they're opposed.

  • if they point in roughly the same direction (θ<90\theta < 90^\circ), the score is positive.
  • if they point in opposite directions (θ>90\theta > 90^\circ), the score is negative.
  • if they are exactly perpendicular (θ=90\theta = 90^\circ), the score is zero.

two notes. the dot in wx{\ww \mathbf{w}} \cdot {\xx \mathbf{x}} is the same multiply-and-add as wx{\ww \mathbf{w}}^\top{\xx \mathbf{x}}: two notations, one operation. and the equality between multiply-and-add and lengths-times-agreement-dial is the one fact in this essay we'll assert without deriving.

the viz computes the score both ways, live. drag w{\ww \mathbf{w}} (where the machine looks) and x{\xx \mathbf{x}} (a martian) anywhere: the two numbers refuse to disagree, the score flips sign the instant the angle crosses 90°, and it grows when either arrow gets longer.

and that last case is the boundary. score=0\text{score} = 0 means "perpendicular to w{\ww \mathbf{w}}." every point perpendicular to the weight vector scores zero, and that set of points forms a line (in 2d), a plane (in 3d), a hyperplane (in general).

learning is just rotating the flashlight so that the beam illuminates the chonkers and leaves the skinny legends in the dark.

one loose end: a real objection, though if 3D slicing makes your head hurt, skip this paragraph freely; nothing below depends on it. the line you dragged in 1.2 didn't pass through the origin, but wx=0{\ww \mathbf{w}}^\top{\xx \mathbf{x}} = 0 is a wall through the origin (x=0{\xx \mathbf{x}} = \mathbf{0} always scores zero, no matter what w{\ww \mathbf{w}} is). both are true, and the bias trick from 1.2 is what reconciles them. in the augmented space [1,H,W][1, H, W], the boundary really is a flat wall through the origin, perpendicular to the full three-component w{\ww \mathbf{w}}. but every actual martian lives on the slice x0=1x_0 = 1, one unit out along the bias axis. slice a tilted wall-through-the-origin with that sheet and you get an offset line. the bias w0w_0 controls the wall's tilt, and the tilt controls where the line lands in your 2D plot. so "ignore the bias" gave you a flat shadow of the honest picture: one dimension up, the flashlight geometry is exact.

the weights are the classifier. change the weights, you rotate the boundary.

we start with random weights and watch the first mistake happen.

1.4

the incident (an actual example with actual numbers)

our machine boots up. the knobs are set randomly:

w=[50.10.3]{\ww \mathbf{w}} = \begin{bmatrix} 5 \\ 0.1 \\ -0.3 \end{bmatrix}

meaning:

  • bias = 5 (starts chonk-leaning)
  • height coefficient = 0.1 (taller → slightly more chonk-leaning)
  • weight coefficient = -0.3 (heavier → MORE skinny-leaning??? backwards, but it's random)

a martian waddles in:

  • ground truth: this is a CHONKER (+1)
  • measurements: height = 170 cm, weight = 76 kg
x=[117076]{\xx \mathbf{x}} = \begin{bmatrix} 1 \\ 170 \\ 76 \end{bmatrix}

the machine computes:

score=51+0.1170+(0.3)76\text{score} = 5 \cdot 1 + 0.1 \cdot 170 + (-0.3) \cdot 76

step by step:

  • bias term: 5×1=55 \times 1 = 5
  • height term: 0.1×170=170.1 \times 170 = 17
  • weight term: 0.3×76=22.8-0.3 \times 76 = -22.8
score=5+1722.8=0.8\text{score} = 5 + 17 - 22.8 = \mathbf{-0.8}

score is negative. machine predicts skinny legend (-1).

but this martian is a CHONKER.

false negative.

geometrically: wx<0{\ww \mathbf{w}}^\top{\xx \mathbf{x}} < 0, so this point is on the “negative” side of the boundary: behind the beam.

and it's not just our random-knob machine that makes mistakes. below is the line you found in 1.2, against the same crowd plus two new arrivals: a water-retaining chonker who measures skinny, and a dehydrated skinny legend who scans chonk. check the verdict under the plot. odds are your line misses someone too. mistakes happen even to good lines. the question is what to do after one.

1.5

learning from mistakes (the update)

our machine called a chonker a skinny legend. the beam is pointing the wrong way. we need to change the weights.

our incident chonker (y=+1y=+1) got a score of -0.8. multiply them together and you get -0.8. when the prediction and label disagree, yscorey \cdot \text{score} is always negative.

take one training example (x,y)({\xx \mathbf{x}}, y), where the label yy is either +1+1 (chonker) or 1-1 (skinny legend).

  • score = wx{\ww \mathbf{w}}^\top {\xx \mathbf{x}}
  • prediction = sign(score)\text{sign}(\text{score}), the score's sign: +1+1 if positive, 1-1 if negative

when the prediction and the label agree, yscorey \cdot \text{score} is positive (two positives or two negatives). when they disagree, it's negative:

y(wx)<0y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) < 0

what change to w{\ww \mathbf{w}} makes y(wx)y({\ww \mathbf{w}}^\top{\xx \mathbf{x}}) go up the fastest?

suppose we add some Δw\Delta\mathbf{w}:

y((w+Δw)x)=y(wx)+y(Δwx)y(({\ww \mathbf{w}} + \Delta{\ww \mathbf{w}})^\top {\xx \mathbf{x}}) = y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + y(\Delta{\ww \mathbf{w}}^\top {\xx \mathbf{x}})

the first term is the old mistake. the second term is the only knob we can turn.

the score is just a simple sum of multiplications. if you nudge one weight by some amount Δwi\Delta w_i, the overall score shifts by exactly that amount multiplied by the input xix_i.

so the example itself tells you where the leverage is. for our incident x=[1,170,76]{\xx \mathbf{x}}=[1,170,76]:

  • add 1 to w0w_0 → score changes by 1
  • add 1 to w1w_1 → score changes by 170
  • add 1 to w2w_2 → score changes by 76

so we want Δw\Delta\mathbf{w} to have a big positive dot product with yxy{\xx \mathbf{x}}.

if you hold the step size fixed (say, Δw=1\|\Delta\mathbf{w}\| = 1), the dot product Δw(yx)=Δwxcosθ\Delta\mathbf{w}^\top (y{\xx \mathbf{x}}) = \|\Delta\mathbf{w}\|\,\|{\xx \mathbf{x}}\|\cos\theta is maximized at θ=0\theta = 0, when Δw\Delta\mathbf{w} points in the same direction as yxy{\xx \mathbf{x}}. just like the flashlight beam in 1.3, we want to physically rotate the weights to point exactly at the missed example.

point Δw\Delta\mathbf{w} directly at yxy{\xx \mathbf{x}}, scaled by a positive number η\eta that you choose (the learning rate: how big a step to take).

ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta \cdot y \cdot {\xx \mathbf{x}}
the perceptron update (error-driven)

when y=+1y=+1 you add x{\xx \mathbf{x}}; when y=1y=-1 you subtract x{\xx \mathbf{x}}. same rule.

that’s the update: when you’re wrong, move toward the labeled point.

in pseudocode:

initialize w (zeros are fine)
repeat for epochs:                      # one epoch = one pass through all points
  for (x, y) in data:
    pred = +1 if dot(w, x) > 0 else -1  # tie (score exactly 0) breaks to -1
    if pred != y:                       # mistake
      w = w + η * y * x

the tie-break is load-bearing. start from w=0{\ww \mathbf{w}} = \mathbf{0} and every score is exactly zero. if a tied score counted as correct, the loop would never fire a single update; the machine would sit at zero forever, technically never wrong by its own generous accounting. calling score 0 a 1-1 means the first chonker who walks in registers as a mistake and gets the weights moving.

here's one update, before and after: the boundary rotates toward the missed point.

1.6

why this helps (the one-step guarantee)

the disagreement detector from 1.5 deserves a name. the signed margin is the score times the label:

m=y(wx)m = y \cdot ({\ww \mathbf{w}}^\top {\xx \mathbf{x}})
  • m>0m > 0: correct side
  • m<0m < 0: wrong side (update fires)
  • m=0m = 0: on the fence; the tie-break from 1.5 calls it 1-1, so it's a mistake exactly when the martian is a chonker

either way, an update only ever fires when m0m \le 0. file that fact away; it carries the whole proof in 1.7.

one small gear before the algebra: a vector dotted with itself is its own squared length, xx=x2{\xx \mathbf{x}}^\top{\xx \mathbf{x}} = \|{\xx \mathbf{x}}\|^2. multiply matching entries with themselves and you've built pythagoras's sum of squares.

after an update ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}:

mnew=y((w+ηyx)x)=y(wx)+ηy2x2=mold+ηx2(since y2 is always exactly 1)\begin{aligned} m_{new} &= y(({\ww \mathbf{w}} + \eta y {\xx \mathbf{x}})^\top {\xx \mathbf{x}}) \\ &= y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + \eta \cdot y^2 \cdot \|{\xx \mathbf{x}}\|^2 \\ &= m_{old} + \eta \|{\xx \mathbf{x}}\|^2 \quad \text{(since } y^2 \text{ is always exactly 1)} \end{aligned}
signed-margin improvement on that example

since η>0\eta > 0 and x2>0\|{\xx \mathbf{x}}\|^2 > 0 for any real input, the signed margin strictly increases on that example.

step through it yourself below. each press visits one point: correct ones pass quietly, wrong ones fire the update and the boundary lurches. nothing protects the other points while it lurches.

one update makes you less wrong on that point. it doesn't promise the whole dataset stops.

can the dataset keep forcing updates forever?

1.7

the inevitable crash (why it stops)

if the data is separable, the perceptron runs out of mistakes. the proof tracks two quantities on a collision course.

(in 1.4 we booted up with random knobs because it makes a better story. the proof starts from zero because it makes cleaner algebra; a nonzero start survives the same argument, it just tacks a constant onto each bound and shifts the final count.)

the mandate (must grow fast)

every update drags w{\ww \mathbf{w}} toward the golden direction. the projection onto u\mathbf{u} (the shadow w{\ww \mathbf{w}} casts along that direction) grows linearly with mistakes kk:

wuk(ηγ){\ww \mathbf{w}}^\top \mathbf{u} \ge k \cdot (\eta {\gm \gamma})
the budget (can only grow slow)

but the length of w{\ww \mathbf{w}} grows at most like k\sqrt{k}:

w2k(ηR)2\|{\ww \mathbf{w}}\|^2 \le k \cdot (\eta {\rr R})^2

(square root both sides: wkηR\|{\ww \mathbf{w}}\| \le \sqrt{k}\,\eta {\rr R}.)

the mandate is almost the definition of margin. each update adds ηyx\eta y{\xx \mathbf{x}}, and y(ux)γy(\mathbf{u}^\top{\xx \mathbf{x}}) \ge {\gm \gamma} for every training point; that's what "classifies everything with margin γ{\gm \gamma}" means. so each mistake pushes wu{\ww \mathbf{w}}^\top\mathbf{u} up by at least ηγ\eta{\gm \gamma}, and starting from zero, kk mistakes force wukηγ{\ww \mathbf{w}}^\top\mathbf{u} \ge k\eta{\gm \gamma}.

the budget is the claim that should bother you. vector lengths don't behave this politely in general: add kk aligned vectors and the length grows linearly, not like k\sqrt{k}. expand what one update does to the squared length:

w+ηyx2=w2+2ηy(wx)+η2x2\|{\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}\|^2 = \|{\ww \mathbf{w}}\|^2 + 2\eta \cdot y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + \eta^2 \|{\xx \mathbf{x}}\|^2
one update's effect on the budget

(that expansion is school's (a+b)2=a2+2ab+b2(a+b)^2 = a^2 + 2ab + b^2, with dot products standing in for multiplication.)

stare at the middle term. it's the signed margin from 1.6, scaled by 2η2\eta. and updates only fire on mistakes, when y(wx)0y({\ww \mathbf{w}}^\top{\xx \mathbf{x}}) \le 0. the middle term is never positive at the moment we add to w{\ww \mathbf{w}}. the trigger that causes the growth is the same condition that caps it: the perceptron only ever lengthens its weights when the weights were pointing the wrong way, and adding a vector you disagree with mostly cancels.

so each mistake adds at most η2x2(ηR)2\eta^2\|{\xx \mathbf{x}}\|^2 \le (\eta {\rr R})^2 of squared length. after kk mistakes, starting from zero: w2k(ηR)2\|{\ww \mathbf{w}}\|^2 \le k(\eta {\rr R})^2.

watch it happen with numbers. say w=[1,2]{\ww \mathbf{w}} = [1, 2] meets x=[1,2]{\xx \mathbf{x}} = [1, -2] with label y=+1y = +1 and η=1\eta = 1: the score is 14=31 - 4 = -3, a mistake, so we add x{\xx \mathbf{x}}. the expansion predicts the new squared length: 5+2(3)+5=45 + 2(-3) + 5 = 4. check it directly: w+x=[2,0]{\ww \mathbf{w}} + {\xx \mathbf{x}} = [2, 0], squared length 44. we added an arrow of squared length 5 and the weights got shorter. that's the middle term doing its job.

the same fact, draggable. the shaded half is where mistakes live: every point there scores negative against w{\ww \mathbf{w}}, so it is the only region the update rule ever adds from. drag x{\xx \mathbf{x}} around both halves and keep your eye on the middle term. in the shaded half it is never positive, and w2\|{\ww \mathbf{w}'}\|^2 stays under the naive sum. in the unshaded half the middle term turns positive and the lengths would stack, but the perceptron never adds from there.

the race is plotted below. γ{\gm \gamma} and R{\rr R} set the slopes (the sliders wear the same colors); drag kk forward and watch the linear mandate close in on the square-root ceiling.

  • the mandate (dot product) demands linear growth (kk).
  • the budget (length) only allows square-root growth (k\sqrt{k}).

but wuwu=w{\ww \mathbf{w}}^\top \mathbf{u} \le \|{\ww \mathbf{w}}\|\,\|\mathbf{u}\| = \|{\ww \mathbf{w}}\| (the formal name for this ceiling is the cauchy-schwarz inequality, but geometrically it just means what we saw in 1.3: cosθ\cos \theta maxes out at 1). a vector's projection onto a unit vector (its shadow along that direction) can never be longer than the vector itself.

wuw{\ww \mathbf{w}}^\top \mathbf{u} \le \|{\ww \mathbf{w}}\|

so:

k(ηγ)k(ηR)k \cdot (\eta {\gm \gamma}) \le \sqrt{k} \cdot (\eta {\rr R})

a linear function of kk will always eventually overtake a square root of kk. but a projection can never outgrow the vector itself. so the only escape is that kk stops growing: the algorithm must run out of mistakes.

the collision happens when the floor meets the ceiling:

kηγkηRk \eta {\gm \gamma} \le \sqrt{k} \eta {\rr R}

divide by ηk\eta \sqrt{k}:

kγR\sqrt{k} {\gm \gamma} \le {\rr R}

square both sides:

kγ2R2    kR2γ2k {\gm \gamma}^2 \le {\rr R}^2 \implies k \le \frac{{\rr R}^2}{{\gm \gamma}^2}

the whole proof fits in one breath:

kηγ  mandate  wu  shadow  w  budget  kηRk\,\eta{\gm \gamma} \;\overset{\text{mandate}}{\le}\; {\ww \mathbf{w}}^\top\mathbf{u} \;\overset{\text{shadow}}{\le}\; \|{\ww \mathbf{w}}\| \;\overset{\text{budget}}{\le}\; \sqrt{k}\,\eta {\rr R}
the entire theorem, one chain. worth memorizing as a single object. same colors as the race above: cyan γ, magenta R, amber for the machine's weights

read it left to right: the left end rises like kk, the right end rises like k\sqrt{k}, and the chain has to hold after every single mistake. a straight line cannot stay under a square root forever. so something must give, and the only thing that can give is kk. squeeze the ends together and out falls k(R/γ)2k \le ({\rr R}/{\gm \gamma})^2.

look at what survived the algebra. the learning rate didn't: it divided out and never came back. with zero starting weights, every weight vector the perceptron visits is a scaled copy of the η=1\eta = 1 run, and scaling never flips the sign of a score, so η\eta changes nothing about which mistakes happen. the knob everyone tunes first is, for this algorithm, decorative. the dimension count never even appeared: the bound sees the data only through R{\rr R} and γ{\gm \gamma}. two measurements per martian or two million: same geometry, same budget.

the bound is a number you can compute from the data before training starts. the loop can keep running forever, but the updates run out: one more mistake would push the mandate through the cauchy-schwarz ceiling, and that inequality doesn't bend. the machine stops misclassifying martians. not "probably won't keep failing." cannot.

now watch it happen. the trainer below starts from zero weights. before you press run, make a prediction: does the boundary glide into place, or thrash before it settles?

run the learn mission to zero mistakes. every lurch of the boundary is one update, and the theorem you just read is the reason the lurching ends. then run fail. that dataset has one flipped label, so no golden u\mathbf{u} exists, there's no γ>0{\gm \gamma} > 0, the mandate evaporates, and the thrashing never has to stop. that's when we reach for the pocket algorithm (keep the best weights seen so far) or switch to a soft-margin classifier like an SVM.

exercise 4 has you run the race yourself: four martians, a pencil, both bounds checked at every mistake.

1.8

making it work

the convergence proof says the perceptron must stop. it doesn't say it stops gracefully.

remember our incident martian from 1.4? x=[1,170,76]{\xx \mathbf{x}} = [1,\, 170,\, 76]. with η=1\eta = 1, the margin jump on each update is ηx2=12+1702+762=34,677\eta\|{\xx \mathbf{x}}\|^2 = 1^2 + 170^2 + 76^2 = 34{,}677. one mistake and the weights are in orbit.

why so violent? because the inputs are raw units. against a bias input of 1, the measurements are a hundred times louder, and the squared length stacks the squares: height alone contributes 1702=28,900170^2 = 28{,}900 of the 34,677. the convergence bound is (R/γ)2({\rr R}/{\gm \gamma})^2, and it's the ratio that matters: multiply every coordinate by 10 and γ{\gm \gamma} inflates right alongside R{\rr R}, changing nothing. the damage comes from features that are loud relative to each other: one booming coordinate inflates the radius while the boundary still has to be negotiated by the quiet ones, so R/γ{\rr R}/{\gm \gamma} balloons.

the fix is boring and universal: standardize features. subtract each feature's average μi\mu_i and divide by its typical spread σi\sigma_i (the standard deviation), xi(xiμi)/σix_i \leftarrow (x_i - \mu_i)/\sigma_i, so no single dimension dominates. after standardization, our martian's coordinates are order-1 instead of order-100, the updates stay sane, and the convergence bound tightens.

you can feel this one too: the trainer in 1.7 has a scale mission. run it raw, open the advanced panel, and watch Δw\|\Delta\mathbf{w}\| in the telemetry: same η\eta, violent steps. flip to standardized and the steps shrink to something civilized.

the recipe

  • standardize features: xi(xiμi)/σix_i \leftarrow (x_i - \mu_i)/\sigma_i (save μi,σi\mu_i,\sigma_i).
  • add bias: include x0=1x_0 = 1 so the boundary can shift.
  • shuffle each epoch: same points, fewer dumb cycles.
  • early stop: stop when mistakes hit 0 (separable) or validation stops improving (non-separable).
  • average or pocket: when in doubt, ship wˉ{\ww \bar{\mathbf{w}}} or the best w{\ww \mathbf{w}} seen.

what if the data can't be separated by any line?

1.9

escaping flatland (xor + why neural nets exist)

four points at the corners of a square. label them by diagonal: one diagonal pair is +1, the other is -1.

ABoutput
00-1
01+1
10+1
11-1

try to draw a line that separates them. really, drag it around.

you can't. sweep the boundary through every angle: at least one side always ends up mixed, and the counter never reaches zero. (it's even being generous: it scores whichever side-labeling flatters your line.) the perceptron will oscillate forever, shoving the boundary back and forth, never settling.

two lines prove it. both diagonals share the same midpoint: the center of the square. and a half-plane has no dents (the formal word is convex): if one side of a line holds a segment's two endpoints, it holds everything between them. so the side holding both +1s would own the center, and the side holding both -1s would own the center too. one point, claimed by both sides of a line. no.

the score w1A+w2Bw_1 A + w_2 B sees each input's contribution independently: it knows "A is big" and "B is big" as separate facts.

what it can't see is whether A and B agree. agreement lives between the inputs, and a weighted sum has no term for between: every input contributes on its own, blind to the others.

there are two ways out:

  1. change the representation (feature engineering)
  2. learn the representation (a neural network)
1.9.1

the feature-map escape hatch (make xor separable)

if the flashlight can't see the pattern, give it a new dimension to look along: one that turns "agreement," the exact thing the weighted sum is blind to, into a raw number we can hand over as an input.

look at the inputs. they are 0s and 1s. we need to compute something new from them. and "new" is a real constraint, because most of the obvious candidates aren't:

  • A+BA+B is a weighted sum. the machine builds weighted sums: hand it A+BA+B as a feature and you've given it a knob it could already turn by setting w1=w2w_1 = w_2. no new direction. same flatland, extra steps.
  • ABA-B, 3A+2B3A + 2B, any recipe of the form αA+βB\alpha A + \beta B: same problem. they're all inside the flashlight's reach already.

the new input has to be something no weighted sum can fake. multiplication is the cheapest such thing: ABA \cdot B is 1 exactly when both inputs are on, and no αA+βB+γ\alpha A + \beta B + \gamma reproduces that table: matching the three 0s forces α=β=γ=0\alpha = \beta = \gamma = 0, which then can't produce the 1.

by manually computing ABA \cdot B and feeding it in alongside the original inputs, we give the machine a dedicated knob for that relationship. geometrically, the new coordinate lifts the (1,1) corner out of the plane, and now a flat sheet of glass can separate the classes. the sheet has to tilt, angled so the lifted (1,1) and the ground-floor (0,0) both end up above it while the two +1 corners sit below. drag the slider and watch the corner climb until the glass can finally get between the classes.

this preprocessing step is called a feature map:

ϕ(A,B)=(1,A,B,AB)\phi(A,B) = (1, A, B, A\cdot B)

(four coordinates once you count the bias; the picture above shows the three that move.)

the mapped points are:

pointϕ(A,B)\phi(A,B)label
(0,0)(1,0,0,0)(1, 0, 0, 0)-1
(0,1)(1,0,1,0)(1, 0, 1, 0)+1
(1,0)(1,1,0,0)(1, 1, 0, 0)+1
(1,1)(1,1,1,1)(1, 1, 1, 1)-1

one separating weight vector is w=[1,2,2,4]{\ww \mathbf{w}} = [-1, 2, 2, -4]. check:

  • (0,0)(0,0): 1+0+0+0=1-1 + 0 + 0 + 0 = -1 → -1 ✓
  • (0,1)(0,1): 1+0+2+0=1-1 + 0 + 2 + 0 = 1 → +1 ✓
  • (1,0)(1,0): 1+2+0+0=1-1 + 2 + 0 + 0 = 1 → +1 ✓
  • (1,1)(1,1): 1+2+24=1-1 + 2 + 2 - 4 = -1 → -1 ✓

translation: “if BOTH inputs are 1, subtract 4 from the score.” that feature makes XOR separable.

but this is the catch: you had to know to add ABA\cdot B. the perceptron can't see features it wasn't given: ABA \cdot B was in the beam's blind spot. it can only learn weights for the dimensions you hand it.

that ABA\cdot B term is the simplest interaction feature: it turns “both are on” into a single coordinate.

in the martian story, HWH\cdot W is the same move. continuous measurements cost it the clean on/off behavior of ABA\cdot B, but the knob still depends on both measurements jointly, which no weighted sum of HH and WW alone can express.

1.9.2

the question for next time

the feature map worked, but only because we knew ABA\cdot B was the right feature.

for XOR that's trivial: two inputs, one pairwise interaction to try. for our martians, we could guess HWH\cdot W. but for a real problem with a hundred measurements, there are 4,9504{,}950 possible pairwise interactions. three-way combinations? 161,700161{,}700. you're not hand-crafting your way out of that.

the perceptron can learn any boundary in the space you give it. but it can't see past that space. if the feature that matters is a combination you didn't think to include, no amount of weight-nudging will find it. the flashlight can only illuminate dimensions that exist.

what if the machine could build its own dimensions?

that's the idea behind a neural network. instead of you designing ABA\cdot B and hoping, you give the machine a stack of layers. each layer computes weighted sums (exactly like a perceptron), then bends the result through a nonlinearity (like clamping negatives to zero). the bend is load-bearing: stack linear layers without it and the tower collapses, because a weighted sum of weighted sums is just one big weighted sum, and we spent this whole section proving what a single one can't do. with the bend, the early layers combine raw inputs into new features, and the later layers combine those into higher-order ones.

nobody tells the network about ABA\cdot B. training rewards whatever accidentally behaves like ABA\cdot B, so that behavior grows.

the final layer is still a dot product. everything we built in this essay (the score, the boundary, the sign check) is still there at the output. a neural network is a perceptron whose inputs are learned features rather than raw measurements.

but stacking layers creates a problem the perceptron never had.

in a single perceptron, when the machine miscalls a chonker, we know exactly who to blame: the one weight vector w{\ww \mathbf{w}}. we nudge it toward yxy{\xx \mathbf{x}} and move on. one knob, one gradient, one step.

in a network with three layers and thousands of weights, the output depends on all of them. when the prediction is wrong, the blame is smeared: maybe an early layer built a bad feature, maybe the last layer drew a bad boundary through good features, most likely some of both. every weight had a hand in the output, and each one deserves a different share of the blame.

credit assignment: which weight to blame, by how much. that's the next essay.

1.10

tl;dr (for the adhd crew)

  1. the machine: weighted sum of inputs → is it positive?
  2. the problem: we don't know the right weights
  3. the fix: when wrong, ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}
  4. why it helps: margin increases by ηx2\eta \|{\xx \mathbf{x}}\|^2 on the mistake example
  5. the catch: if a perfect line with breathing room γ{\gm \gamma} exists, the total number of mistakes is capped at (R/γ)2({\rr R}/{\gm \gamma})^2, where R{\rr R} is the length of the longest input vector. if no such line exists, the updates can thrash forever
  6. the deeper catch: if your representation is wrong (xor), no amount of linear sweat will save you
  7. the open question: neural nets learn their own features, but when the output is wrong, which weights do you blame? (next essay: backpropagation)
1.11

exercises (pencil + paper)

the goal: you should be able to do these with a pencil, and you should know you're right at the end.

1One Update, No Excuses
Question
  • start with w=[2,1,0.5]{\ww \mathbf{w}} = [2, -1, 0.5] (bias, height, weight)
  • one martian: x=[1,4,3]{\xx \mathbf{x}} = [1, 4, 3], true label y=+1y = +1; measurements already standardized per the §1.8 recipe (nobody here is four centimeters tall)
  • learning rate: η=1\eta = 1

do this by hand:

  1. compute score=wx\text{score} = {\ww \mathbf{w}}^\top {\xx \mathbf{x}} and the predicted label
  2. if it's wrong, apply the update ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}
  3. recompute the score with the new weights

score = 2+(1)4+0.53=0.52 + (-1)\cdot 4 + 0.5 \cdot 3 = -0.5, so prediction is -1 (wrong). update:

wnew=[2,1,0.5]+(+1)[1,4,3]=[3,3,3.5]{\ww \mathbf{w}_{new}} = [2, -1, 0.5] + (+1)[1,4,3] = [3, 3, 3.5]

new score = 3+34+3.53=25.53 + 3\cdot 4 + 3.5\cdot 3 = 25.5 (positive), so it flips to +1. you can also verify the margin jump:

mnew=mold+ηx2=(0.5)+(12+42+32)=25.5m_{new} = m_{old} + \eta \|{\xx \mathbf{x}}\|^2 = (-0.5) + (1^2 + 4^2 + 3^2) = 25.5
2Bias Matters
Question

consider two points with a single feature xx:

  • xA=2x_A = 2, yA=+1y_A = +1
  • xB=5x_B = 5, yB=1y_B = -1

now do two mini-cases:

  1. no bias term: you are forced to use score=w1x\text{score} = w_1 x (boundary must pass through 0)
  2. with bias: you can use score=w0+w1x\text{score} = w_0 + w_1 x

with no bias, the boundary is w1x=0w_1 x = 0, so x=0x = 0. both points have positive x, so they always land on the same side and can't be separated.

with bias, pick any threshold between 2 and 5, e.g. boundary x=3.5x = 3.5. one valid choice:

w0=3.5,  w1=1w_0 = 3.5,\; w_1 = -1

then scoreA=3.52=1.5>0\text{score}_A = 3.5 - 2 = 1.5 > 0 (A is +1) and scoreB=3.55=1.5<0\text{score}_B = 3.5 - 5 = -1.5 < 0 (B is -1).

3When It Doesn't Converge (Pocket vs Averaged)
Question

this dataset is not separable on purpose: two examples share identical features but disagree on the label.

use a 2d input with bias: x=[1,x1,x2]{\xx \mathbf{x}} = [1, x_1, x_2].

training set (in this exact order), with η=1\eta = 1 and starting weights w=[0,0,0]{\ww \mathbf{w}} = [0,0,0]:

  • A: xA=[1,1,1]{\xx \mathbf{x}_A} = [1, 1, 1], yA=+1y_A = +1
  • B: xB=[1,1,2]{\xx \mathbf{x}_B} = [1, -1, 2], yB=+1y_B = +1
  • C: xC=[1,1,1]{\xx \mathbf{x}_C} = [1, 1, 1], yC=1y_C = -1 (label noise: same features as A, opposite label)

your job:

  1. run one epoch and write down w{\ww \mathbf{w}} after A, after B, and after C
  2. compute the pocket weight: among the weights you visited, which one makes the fewest mistakes on A/B/C?
  3. compute the averaged weight: wˉ{\ww \bar{\mathbf{w}}} = average of the three post-step weights, and count its mistakes on A/B/C

run the epoch:

  • start w=[0,0,0]{\ww \mathbf{w}} = [0,0,0]
  • A is misclassified (score = 0 → -1), so update: ww+(+1)[1,1,1]=[1,1,1]{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + (+1)[1,1,1] = [1,1,1]
  • B is correct under [1,1,1][1,1,1] (score = 1+(1)+2=21 + (-1) + 2 = 2), so no update (still [1,1,1][1,1,1])
  • C is misclassified under [1,1,1][1,1,1] (score = 3 → +1 but y=1y=-1), so update: w[1,1,1]+(1)[1,1,1]=[0,0,0]{\ww \mathbf{w}} \leftarrow [1,1,1] + (-1)[1,1,1] = [0,0,0]

so the visited weights are:

  • after A: [1,1,1][1,1,1]
  • after B: [1,1,1][1,1,1]
  • after C: [0,0,0][0,0,0]

pocket: evaluate mistakes on A/B/C.

  • with [1,1,1][1,1,1]: A ✓, B ✓, C ✗ → 1 mistake
  • with [0,0,0][0,0,0]: A ✗, B ✗, C ✓ → 2 mistakes

so pocket keeps [1,1,1][1,1,1].

averaged: wˉ=[1,1,1]+[1,1,1]+[0,0,0]3=[2/3,2/3,2/3]{\ww \bar{\mathbf{w}}} = \frac{[1,1,1] + [1,1,1] + [0,0,0]}{3} = [2/3, 2/3, 2/3]. this predicts A ✓, B ✓, C ✗ → 1 mistake.

the moral: when the data is non-separable, the last iterate can be garbage. pocket/average gives you something stable to ship.

4Watch the Race (the proof, with numbers)
Question

the convergence proof made two promises: the mandate floor wukηγ{\ww \mathbf{w}}^\top\mathbf{u} \ge k\eta{\gm \gamma} and the budget ceiling w2k(ηR)2\|{\ww \mathbf{w}}\|^2 \le k(\eta {\rr R})^2. verify both, mistake by mistake, on a dataset small enough for a pencil.

  • A: xA=[1,2]{\xx \mathbf{x}_A} = [1, 2], yA=+1y_A = +1
  • B: xB=[1,2]{\xx \mathbf{x}_B} = [1, -2], yB=+1y_B = +1
  • C: xC=[1,2]{\xx \mathbf{x}_C} = [-1, 2], yC=1y_C = -1
  • D: xD=[1,2]{\xx \mathbf{x}_D} = [-1, -2], yD=1y_D = -1

(no bias input this time; the boundary through the origin works here.) take the golden direction u=[1,0]\mathbf{u} = [1, 0] as given.

your job:

  1. compute the margin γ=miniyi(uxi){\gm \gamma} = \min_i y_i(\mathbf{u}^\top {\xx \mathbf{x}_i}) and the radius R=maxixi{\rr R} = \max_i \|{\xx \mathbf{x}_i}\|, then the mistake bound (R/γ)2({\rr R}/{\gm \gamma})^2
  2. run the perceptron by hand: w=[0,0]{\ww \mathbf{w}} = [0,0], η=1\eta = 1, order A, B, C, D, ties predict 1-1. after every mistake kk, record w{\ww \mathbf{w}}, wu{\ww \mathbf{w}}^\top\mathbf{u}, and w2\|{\ww \mathbf{w}}\|^2
  3. check the chain at each kk: is wukηγ{\ww \mathbf{w}}^\top\mathbf{u} \ge k\eta{\gm \gamma}? is w2k(ηR)2\|{\ww \mathbf{w}}\|^2 \le k(\eta {\rr R})^2?
  4. at the second mistake, compute the middle term 2ηy(wx)2\eta \cdot y({\ww \mathbf{w}}^\top{\xx \mathbf{x}}) and verify the expansion w+ηyx2=w2+2ηy(wx)+η2x2\|{\ww \mathbf{w}} + \eta y{\xx \mathbf{x}}\|^2 = \|{\ww \mathbf{w}}\|^2 + 2\eta\, y({\ww \mathbf{w}}^\top{\xx \mathbf{x}}) + \eta^2\|{\xx \mathbf{x}}\|^2 against the direct computation

1. the constants. margins against u=[1,0]\mathbf{u} = [1,0]: A: (+1)(1)=1(+1)(1) = 1, B: (+1)(1)=1(+1)(1) = 1, C: (1)(1)=1(-1)(-1) = 1, D: (1)(1)=1(-1)(-1) = 1. so γ=1{\gm \gamma} = 1. every point has x=1+4=5\|{\xx \mathbf{x}}\| = \sqrt{1 + 4} = \sqrt{5}, so R=5{\rr R} = \sqrt{5}. bound: (R/γ)2=5({\rr R}/{\gm \gamma})^2 = 5. at most 5 mistakes, promised before training starts.

2-3. the race.

  • A: score =0= 0 → tie → predict 1-1, label +1+1. mistake (k=1k=1). w[0,0]+[1,2]=[1,2]{\ww \mathbf{w}} \leftarrow [0,0] + [1,2] = [1, 2].
    • mandate: wu=1kηγ=1{\ww \mathbf{w}}^\top\mathbf{u} = 1 \ge k\eta{\gm \gamma} = 1 ✓ (exactly on the floor)
    • budget: w2=5k(ηR)2=5\|{\ww \mathbf{w}}\|^2 = 5 \le k(\eta {\rr R})^2 = 5 ✓ (exactly on the ceiling)
  • B: score =14=3= 1 - 4 = -3 → predict 1-1, label +1+1. mistake (k=2k=2). w[1,2]+[1,2]=[2,0]{\ww \mathbf{w}} \leftarrow [1,2] + [1,-2] = [2, 0].
    • mandate: wu=22{\ww \mathbf{w}}^\top\mathbf{u} = 2 \ge 2 ✓ (on the floor again; both mistakes were minimum-margin points)
    • budget: w2=410\|{\ww \mathbf{w}}\|^2 = 4 \le 10
  • C: score =2= -2 → predict 1-1 = label ✓. D: score =2= -2 → ✓.
  • pass 2: A scores 22, B scores 22, C scores 2-2, D scores 2-2. all correct. converged after k=2k = 2 mistakes, comfortably under the promised 5.

4. the middle term, caught in the act. at mistake 2, the weights were [1,2][1,2] and the point was [1,2][1,-2] with y=+1y = +1: the middle term is 2ηy(wx)=2(3)=62\eta \cdot y({\ww \mathbf{w}}^\top{\xx \mathbf{x}}) = 2(-3) = -6. expansion: w2+middle+x2=56+5=4\|{\ww \mathbf{w}}\|^2 + \text{middle} + \|{\xx \mathbf{x}}\|^2 = 5 - 6 + 5 = 4. direct: [2,0]2=4\|[2,0]\|^2 = 4. ✓. the update added an arrow of squared length 5 and the weights got shorter. the budget holds because the machine only ever eats points it disagrees with.

BonusEscape the Number Line
Question

here's a 1D dataset:

xxlabel
0-1
1+1
2+1
3-1

the +1s are sandwiched between the -1s.

your job:

  1. explain why no single threshold w0+w1x=0w_0 + w_1 x = 0 can separate these points
  2. propose a feature map ϕ(x)\phi(x) that makes them separable (hint: think quadratic)
  3. find a weight vector w{\ww \mathbf{w}} that works, and verify it on all four points

a single threshold splits the number line into two half-lines: everything left of the cut, and everything right. but the +1s live in the middle. you need a feature that "closes" around the +1s: something that's big in the middle and small at the edges (or vice versa).

1. a linear boundary w0+w1x=0w_0 + w_1 x = 0 is a single cut point on the number line. one side is +1, the other is -1. but the +1 region here is an interval [1,2][1,2], not a half-line. you'd need two cuts. one threshold can't do it.

2. use ϕ(x)=(1,x,x2)\phi(x) = (1, x, x^2). the quadratic term lets the boundary curve.

3. try w=[1,3,1]{\ww \mathbf{w}} = [-1, 3, -1]. compute wϕ(x)=1+3xx2{\ww \mathbf{w}}^\top \phi(x) = -1 + 3x - x^2:

  • x=0x=0: 1+00=1-1 + 0 - 0 = -1 → -1 ✓
  • x=1x=1: 1+31=1-1 + 3 - 1 = 1 → +1 ✓
  • x=2x=2: 1+64=1-1 + 6 - 4 = 1 → +1 ✓
  • x=3x=3: 1+99=1-1 + 9 - 9 = -1 → -1 ✓

the quadratic x2+3x1-x^2 + 3x - 1 is a downward parabola: positive at x=1x=1 and x=2x=2, negative at the endpoints. the perceptron can't invent x2x^2 (you had to add it), but once you do, the problem is linearly separable in the lifted space.

Bonus 2Implement the Loop
Question

using the 5-bullet recipe from §1.8 as your spec, write the full training loop. (don't write code? skip to the hand-trace tables in the solution; every row is checkable arithmetic.) here's a small dataset to test it on:

  • A = (1,1,1)(1, 1, 1), y=+1y = +1
  • B = (1,2,1)(1, 2, 1), y=+1y = +1
  • C = (1,0.5,2)(1, 0.5, 2), y=1y = -1

(already includes bias.) run 3 full passes through the data with η=1\eta = 1, processing in order A, B, C each pass. report:

  1. final weights
  2. pocket weights (best seen)
  3. averaged weights

the recipe bullets are the spec. pocket = keep the weight vector with fewest mistakes on the full dataset. average = running sum of every weight vector seen, divided by total steps.

w = [0, 0, 0]
w_sum = [0, 0, 0]; t = 0
w_pocket = w; best_mistakes = count_mistakes(w, all_data)
 
for epoch in 1..N:          # one pass through all points
  # (shuffle in production; fixed order here for hand-tracing)
  mistakes = 0
  for (x, y) in data:
    score = dot(w, x)
    pred = +1 if score > 0 else -1
    if pred != y:
      w = w + eta * y * x
      mistakes += 1
      m = count_mistakes(w, all_data)   # pocket check; w only changed here
      if m < best_mistakes:
        w_pocket = w; best_mistakes = m
    w_sum += w; t += 1
 
  if mistakes == 0: break
 
w_avg = w_sum / t
return w, w_pocket, w_avg

hand-trace, pass 1 (w starts at [0,0,0][0,0,0]):

steppointwx{\ww \mathbf{w}}^\top {\xx \mathbf{x}}predlabelmistake?w{\ww \mathbf{w}} after
1A (1,1,1)(1,1,1), y=+1y=+1001-1+1+1yes[1,1,1][1,1,1]
2B (1,2,1)(1,2,1), y=+1y=+11+2+1=41+2+1=4+1+1+1+1no[1,1,1][1,1,1]
3C (1,0.5,2)(1,0.5,2), y=1y=-11+0.5+2=3.51+0.5+2=3.5+1+11-1yes[0,0.5,1][0, 0.5, -1]

after pass 1: w=[0,0.5,1]{\ww \mathbf{w}} = [0, 0.5, -1]. check all points:

  • A: 0+0.51=0.50 + 0.5 - 1 = -0.5 → pred 1-1, label +1+1. wrong
  • B: 0+11=00 + 1 - 1 = 0 → pred 1-1, label +1+1. wrong
  • C: 0+0.252=1.750 + 0.25 - 2 = -1.75 → pred 1-1, label 1-1. correct

the current weights make 2 mistakes, but the pocket isn't holding the current weights. it checks every weight vector the loop visits, and step 1's [1,1,1][1,1,1] makes only 1 mistake on the full dataset (A: 33 ✓, B: 44 ✓, C: 3.5+13.5 \to +1 ✗). nothing since has beaten that, so the pocket holds [1,1,1][1,1,1].

continue passes 2-3 the same way. your implementation should reproduce this trace for pass 1.

pass 2 (w starts at [0,0.5,1][0, 0.5, -1]):

steppointwx{\ww \mathbf{w}}^\top {\xx \mathbf{x}}predlabelmistake?w{\ww \mathbf{w}} after
4A0+0.51=0.50 + 0.5 - 1 = -0.51-1+1+1yes[1,1.5,0][1, 1.5, 0]
5B1+3+0=41 + 3 + 0 = 4+1+1+1+1no[1,1.5,0][1, 1.5, 0]
6C1+0.75+0=1.751 + 0.75 + 0 = 1.75+1+11-1yes[0,1,2][0, 1, -2]

end of pass 2: w=[0,1,2]{\ww \mathbf{w}} = [0, 1, -2]. neither visited weight beats the pocket's 1 mistake: step 4's [1,1.5,0][1, 1.5, 0] ties it (only C wrong, at 1.75+11.75 \to +1), and [0,1,2][0, 1, -2] makes 2 (A and B wrong). pocket stays [1,1,1][1,1,1].

pass 3 (w starts at [0,1,2][0, 1, -2]):

steppointwx{\ww \mathbf{w}}^\top {\xx \mathbf{x}}predlabelmistake?w{\ww \mathbf{w}} after
7A0+12=10 + 1 - 2 = -11-1+1+1yes[1,2,1][1, 2, -1]
8B1+41=41 + 4 - 1 = 4+1+1+1+1no[1,2,1][1, 2, -1]
9C1+12=01 + 1 - 2 = 01-11-1no[1,2,1][1, 2, -1]

end of pass 3: w=[1,2,1]{\ww \mathbf{w}} = [1, 2, -1]. check all: A =2= 2 ✓, B =4= 4 ✓, C =01= 0 \to -1 ✓. 0 mistakes. the pocket check at step 7 catches it the moment it's created: pocket updates to [1,2,1][1, 2, -1].

(note: point C lands exactly on the boundary with score 0. our code’s tie-breaker else -1 saves us here, but mathematically, you'd usually want a strictly non-zero score to claim true separation!)

final answers:

  1. final weights: [1,2,1][1, 2, -1]
  2. pocket weights: [1,2,1][1, 2, -1] (0 mistakes; converged, so pocket = final)
  3. averaged weights: wˉ=19t=19wt=[7/9,  12.5/9,  4/9][0.78,  1.39,  0.44]{\ww \bar{\mathbf{w}}} = \frac{1}{9}\sum_{t=1}^{9} {\ww \mathbf{w}_t} = [7/9,\; 12.5/9,\; -4/9] \approx [0.78,\; 1.39,\; -0.44]; gets A and B right but misclassifies C (1 mistake)