the setup (mars edition)
we're on mars. we have two biological castes of martians:
absolute units. dense.
built like refrigerators filled with other refrigerators.
gracile. hollow-boned.
built like a skittish sentient breadstick.
our job: build a machine that looks at a martian's measurements, then flashes the right light.
we have a clipboard. we start with two measurements:
- = height in cm
- = weight in kg
we write our measurements as a vector, which, for now, just means the measurements stacked in a bracket:
on the scatter plots below, i’ll put weight on the horizontal axis and height on the vertical, because it reads cleaner. in the algebra, just means “the measurements”; the order is arbitrary as long as uses the same order.
drawing the line
before we do any math, solve the problem visually.
we want a decision boundary: a line that separates the chonkers () from the skinny legends ().
try it yourself. drag the two handles to position the line so it separates the two groups.
now: what is that line, mathematically?
building the decision engine from scratch
how do we turn two numbers into one decision? we invent a score, then choose a cutoff: means chonk.
look back at the scatter you just dragged a line through. in this toy dataset, the chonkers mostly live to the right: heavier.
so a decent first bet is a score that rises when either measurement rises: move right, score goes up; move up, score goes up.
this score adds centimeters to kilograms like they're the same thing. we'll fix that. but it already collapses the clipboard to a single number we can threshold ("chonker if score > c"), and the threshold has a shape: the points where form a straight diagonal line. that's the thing you just drew, rediscovered as algebra.
two problems remain: cm and kg aren’t comparable, and sometimes one feature should count more than the other.
weights are exchange rates
if height should count double, you'd write:
that “2” is literally an exchange rate.
replace the hard-coded 1s with adjustable weights, and :
interpretation:
- one cm of height buys you score points
- one kg of weight buys you score points
this lets us put heterogeneous measurements on the same scoreboard.
bias: pick the cutoff
we need one more number: a cutoff . that’s the line in the sand.
- if → chonker
- if → skinny legend
- if → exactly on the boundary
what does that boundary look like? plot every point where . you get a straight line: that's what you were dragging around.
that straight line is just what "tied score" looks like in 2D. with three features it's a plane. with a hundred it's a hyperplane.
comparing to is the same as moving to the left and comparing to zero. define and write:
same boundary, cleaner sign check.
sanity check with real numbers
here's the equation you found:
test it on two martians:
-
H=175, W=78 (a chonker):
-
H=160, W=52 (a skinny legend):
the chonker should score above the cutoff and the skinny legend below it. if either verdict reads wrong, your line probably needs work: scroll up and drag until both come out right. (one caveat: the printed check rounds your coefficients, so a line that only barely squeaks past a point can round the wrong way. give the boundary some breathing room.) the machine has no judgment beyond the three numbers you picked.
three numbers → a machine that can tell them apart.
the general form
the weights control the angle of the boundary. the bias (or equivalently the cutoff ) shifts it.
it works for any number of features. add age, antennas? just add more weighted terms. it's always a flat cut through the space (a "hyperplane").
we can fold the bias into the weights to keep the algebra clean: treat it as a weight attached to a constant input . then:
which is a product of two vectors:
so if we define:
the whole machine is:
one vector of weights, one vector of inputs. multiply and sum. (that's all the raised means: tip the column of weights onto its side so it can pair up with the inputs. wherever you see , read "multiply matching entries, add them up.")
the geometry of the dot product
we have this equation . what does it look like?
ignore the bias for a second. so far a vector has been a column of numbers. plot those numbers as coordinates (first number along one axis, second along the other) and the same vector becomes an arrow from the origin to that point. one object, two costumes. the arrow's length gets written , and it's just pythagoras on the column: . that is all the double bars ever mean, here or anywhere below.
together, form a vector: an arrow pointing somewhere in height-weight space. that's where the machine is looking.
the score measures how well a point aligns with that direction:
where is the angle between the two arrows. if trig is a distant memory: is a dial for agreement. it reads 1 when the arrows point the same way, slides to 0 as they approach 90°, and hits −1 when they're opposed.
- if they point in roughly the same direction (), the score is positive.
- if they point in opposite directions (), the score is negative.
- if they are exactly perpendicular (), the score is zero.
two notes. the dot in is the same multiply-and-add as : two notations, one operation. and the equality between multiply-and-add and lengths-times-agreement-dial is the one fact in this essay we'll assert without deriving.
the viz computes the score both ways, live. drag (where the machine looks) and (a martian) anywhere: the two numbers refuse to disagree, the score flips sign the instant the angle crosses 90°, and it grows when either arrow gets longer.
and that last case is the boundary. means "perpendicular to ." every point perpendicular to the weight vector scores zero, and that set of points forms a line (in 2d), a plane (in 3d), a hyperplane (in general).
learning is just rotating the flashlight so that the beam illuminates the chonkers and leaves the skinny legends in the dark.
one loose end: a real objection, though if 3D slicing makes your head hurt, skip this paragraph freely; nothing below depends on it. the line you dragged in 1.2 didn't pass through the origin, but is a wall through the origin ( always scores zero, no matter what is). both are true, and the bias trick from 1.2 is what reconciles them. in the augmented space , the boundary really is a flat wall through the origin, perpendicular to the full three-component . but every actual martian lives on the slice , one unit out along the bias axis. slice a tilted wall-through-the-origin with that sheet and you get an offset line. the bias controls the wall's tilt, and the tilt controls where the line lands in your 2D plot. so "ignore the bias" gave you a flat shadow of the honest picture: one dimension up, the flashlight geometry is exact.
the weights are the classifier. change the weights, you rotate the boundary.
we start with random weights and watch the first mistake happen.
the incident (an actual example with actual numbers)
our machine boots up. the knobs are set randomly:
meaning:
- bias = 5 (starts chonk-leaning)
- height coefficient = 0.1 (taller → slightly more chonk-leaning)
- weight coefficient = -0.3 (heavier → MORE skinny-leaning??? backwards, but it's random)
a martian waddles in:
- ground truth: this is a CHONKER (+1)
- measurements: height = 170 cm, weight = 76 kg
the machine computes:
step by step:
- bias term:
- height term:
- weight term:
score is negative. machine predicts skinny legend (-1).
but this martian is a CHONKER.
false negative.
geometrically: , so this point is on the “negative” side of the boundary: behind the beam.
and it's not just our random-knob machine that makes mistakes. below is the line you found in 1.2, against the same crowd plus two new arrivals: a water-retaining chonker who measures skinny, and a dehydrated skinny legend who scans chonk. check the verdict under the plot. odds are your line misses someone too. mistakes happen even to good lines. the question is what to do after one.
learning from mistakes (the update)
our machine called a chonker a skinny legend. the beam is pointing the wrong way. we need to change the weights.
our incident chonker () got a score of -0.8. multiply them together and you get -0.8. when the prediction and label disagree, is always negative.
take one training example , where the label is either (chonker) or (skinny legend).
- score =
- prediction = , the score's sign: if positive, if negative
when the prediction and the label agree, is positive (two positives or two negatives). when they disagree, it's negative:
what change to makes go up the fastest?
suppose we add some :
the first term is the old mistake. the second term is the only knob we can turn.
the score is just a simple sum of multiplications. if you nudge one weight by some amount , the overall score shifts by exactly that amount multiplied by the input .
so the example itself tells you where the leverage is. for our incident :
- add 1 to → score changes by 1
- add 1 to → score changes by 170
- add 1 to → score changes by 76
so we want to have a big positive dot product with .
if you hold the step size fixed (say, ), the dot product is maximized at , when points in the same direction as . just like the flashlight beam in 1.3, we want to physically rotate the weights to point exactly at the missed example.
point directly at , scaled by a positive number that you choose (the learning rate: how big a step to take).
when you add ; when you subtract . same rule.
that’s the update: when you’re wrong, move toward the labeled point.
in pseudocode:
initialize w (zeros are fine)
repeat for epochs: # one epoch = one pass through all points
for (x, y) in data:
pred = +1 if dot(w, x) > 0 else -1 # tie (score exactly 0) breaks to -1
if pred != y: # mistake
w = w + η * y * xthe tie-break is load-bearing. start from and every score is exactly zero. if a tied score counted as correct, the loop would never fire a single update; the machine would sit at zero forever, technically never wrong by its own generous accounting. calling score 0 a means the first chonker who walks in registers as a mistake and gets the weights moving.
here's one update, before and after: the boundary rotates toward the missed point.
why this helps (the one-step guarantee)
the disagreement detector from 1.5 deserves a name. the signed margin is the score times the label:
- : correct side
- : wrong side (update fires)
- : on the fence; the tie-break from 1.5 calls it , so it's a mistake exactly when the martian is a chonker
either way, an update only ever fires when . file that fact away; it carries the whole proof in 1.7.
one small gear before the algebra: a vector dotted with itself is its own squared length, . multiply matching entries with themselves and you've built pythagoras's sum of squares.
after an update :
since and for any real input, the signed margin strictly increases on that example.
step through it yourself below. each press visits one point: correct ones pass quietly, wrong ones fire the update and the boundary lurches. nothing protects the other points while it lurches.
one update makes you less wrong on that point. it doesn't promise the whole dataset stops.
can the dataset keep forcing updates forever?
the inevitable crash (why it stops)
if the data is separable, the perceptron runs out of mistakes. the proof tracks two quantities on a collision course.
(in 1.4 we booted up with random knobs because it makes a better story. the proof starts from zero because it makes cleaner algebra; a nonzero start survives the same argument, it just tacks a constant onto each bound and shifts the final count.)
every update drags toward the golden direction. the projection onto (the shadow casts along that direction) grows linearly with mistakes :
but the length of grows at most like :
(square root both sides: .)
the mandate is almost the definition of margin. each update adds , and for every training point; that's what "classifies everything with margin " means. so each mistake pushes up by at least , and starting from zero, mistakes force .
the budget is the claim that should bother you. vector lengths don't behave this politely in general: add aligned vectors and the length grows linearly, not like . expand what one update does to the squared length:
(that expansion is school's , with dot products standing in for multiplication.)
stare at the middle term. it's the signed margin from 1.6, scaled by . and updates only fire on mistakes, when . the middle term is never positive at the moment we add to . the trigger that causes the growth is the same condition that caps it: the perceptron only ever lengthens its weights when the weights were pointing the wrong way, and adding a vector you disagree with mostly cancels.
so each mistake adds at most of squared length. after mistakes, starting from zero: .
watch it happen with numbers. say meets with label and : the score is , a mistake, so we add . the expansion predicts the new squared length: . check it directly: , squared length . we added an arrow of squared length 5 and the weights got shorter. that's the middle term doing its job.
the same fact, draggable. the shaded half is where mistakes live: every point there scores negative against , so it is the only region the update rule ever adds from. drag around both halves and keep your eye on the middle term. in the shaded half it is never positive, and stays under the naive sum. in the unshaded half the middle term turns positive and the lengths would stack, but the perceptron never adds from there.
the race is plotted below. and set the slopes (the sliders wear the same colors); drag forward and watch the linear mandate close in on the square-root ceiling.
- the mandate (dot product) demands linear growth ().
- the budget (length) only allows square-root growth ().
but (the formal name for this ceiling is the cauchy-schwarz inequality, but geometrically it just means what we saw in 1.3: maxes out at 1). a vector's projection onto a unit vector (its shadow along that direction) can never be longer than the vector itself.
so:
a linear function of will always eventually overtake a square root of . but a projection can never outgrow the vector itself. so the only escape is that stops growing: the algorithm must run out of mistakes.
the collision happens when the floor meets the ceiling:
divide by :
square both sides:
the whole proof fits in one breath:
read it left to right: the left end rises like , the right end rises like , and the chain has to hold after every single mistake. a straight line cannot stay under a square root forever. so something must give, and the only thing that can give is . squeeze the ends together and out falls .
look at what survived the algebra. the learning rate didn't: it divided out and never came back. with zero starting weights, every weight vector the perceptron visits is a scaled copy of the run, and scaling never flips the sign of a score, so changes nothing about which mistakes happen. the knob everyone tunes first is, for this algorithm, decorative. the dimension count never even appeared: the bound sees the data only through and . two measurements per martian or two million: same geometry, same budget.
the bound is a number you can compute from the data before training starts. the loop can keep running forever, but the updates run out: one more mistake would push the mandate through the cauchy-schwarz ceiling, and that inequality doesn't bend. the machine stops misclassifying martians. not "probably won't keep failing." cannot.
now watch it happen. the trainer below starts from zero weights. before you press run, make a prediction: does the boundary glide into place, or thrash before it settles?
run the learn mission to zero mistakes. every lurch of the boundary is one update, and the theorem you just read is the reason the lurching ends. then run fail. that dataset has one flipped label, so no golden exists, there's no , the mandate evaporates, and the thrashing never has to stop. that's when we reach for the pocket algorithm (keep the best weights seen so far) or switch to a soft-margin classifier like an SVM.
exercise 4 has you run the race yourself: four martians, a pencil, both bounds checked at every mistake.
making it work
the convergence proof says the perceptron must stop. it doesn't say it stops gracefully.
remember our incident martian from 1.4? . with , the margin jump on each update is . one mistake and the weights are in orbit.
why so violent? because the inputs are raw units. against a bias input of 1, the measurements are a hundred times louder, and the squared length stacks the squares: height alone contributes of the 34,677. the convergence bound is , and it's the ratio that matters: multiply every coordinate by 10 and inflates right alongside , changing nothing. the damage comes from features that are loud relative to each other: one booming coordinate inflates the radius while the boundary still has to be negotiated by the quiet ones, so balloons.
the fix is boring and universal: standardize features. subtract each feature's average and divide by its typical spread (the standard deviation), , so no single dimension dominates. after standardization, our martian's coordinates are order-1 instead of order-100, the updates stay sane, and the convergence bound tightens.
you can feel this one too: the trainer in 1.7 has a scale mission. run it raw, open the advanced panel, and watch in the telemetry: same , violent steps. flip to standardized and the steps shrink to something civilized.
the recipe
- standardize features: (save ).
- add bias: include so the boundary can shift.
- shuffle each epoch: same points, fewer dumb cycles.
- early stop: stop when mistakes hit 0 (separable) or validation stops improving (non-separable).
- average or pocket: when in doubt, ship or the best seen.
what if the data can't be separated by any line?
escaping flatland (xor + why neural nets exist)
four points at the corners of a square. label them by diagonal: one diagonal pair is +1, the other is -1.
| A | B | output |
|---|---|---|
| 0 | 0 | -1 |
| 0 | 1 | +1 |
| 1 | 0 | +1 |
| 1 | 1 | -1 |
try to draw a line that separates them. really, drag it around.
you can't. sweep the boundary through every angle: at least one side always ends up mixed, and the counter never reaches zero. (it's even being generous: it scores whichever side-labeling flatters your line.) the perceptron will oscillate forever, shoving the boundary back and forth, never settling.
two lines prove it. both diagonals share the same midpoint: the center of the square. and a half-plane has no dents (the formal word is convex): if one side of a line holds a segment's two endpoints, it holds everything between them. so the side holding both +1s would own the center, and the side holding both -1s would own the center too. one point, claimed by both sides of a line. no.
the score sees each input's contribution independently: it knows "A is big" and "B is big" as separate facts.
what it can't see is whether A and B agree. agreement lives between the inputs, and a weighted sum has no term for between: every input contributes on its own, blind to the others.
there are two ways out:
- change the representation (feature engineering)
- learn the representation (a neural network)
the feature-map escape hatch (make xor separable)
if the flashlight can't see the pattern, give it a new dimension to look along: one that turns "agreement," the exact thing the weighted sum is blind to, into a raw number we can hand over as an input.
look at the inputs. they are 0s and 1s. we need to compute something new from them. and "new" is a real constraint, because most of the obvious candidates aren't:
- is a weighted sum. the machine builds weighted sums: hand it as a feature and you've given it a knob it could already turn by setting . no new direction. same flatland, extra steps.
- , , any recipe of the form : same problem. they're all inside the flashlight's reach already.
the new input has to be something no weighted sum can fake. multiplication is the cheapest such thing: is 1 exactly when both inputs are on, and no reproduces that table: matching the three 0s forces , which then can't produce the 1.
by manually computing and feeding it in alongside the original inputs, we give the machine a dedicated knob for that relationship. geometrically, the new coordinate lifts the (1,1) corner out of the plane, and now a flat sheet of glass can separate the classes. the sheet has to tilt, angled so the lifted (1,1) and the ground-floor (0,0) both end up above it while the two +1 corners sit below. drag the slider and watch the corner climb until the glass can finally get between the classes.
this preprocessing step is called a feature map:
(four coordinates once you count the bias; the picture above shows the three that move.)
the mapped points are:
| point | label | |
|---|---|---|
| (0,0) | -1 | |
| (0,1) | +1 | |
| (1,0) | +1 | |
| (1,1) | -1 |
one separating weight vector is . check:
- : → -1 ✓
- : → +1 ✓
- : → +1 ✓
- : → -1 ✓
translation: “if BOTH inputs are 1, subtract 4 from the score.” that feature makes XOR separable.
but this is the catch: you had to know to add . the perceptron can't see features it wasn't given: was in the beam's blind spot. it can only learn weights for the dimensions you hand it.
that term is the simplest interaction feature: it turns “both are on” into a single coordinate.
in the martian story, is the same move. continuous measurements cost it the clean on/off behavior of , but the knob still depends on both measurements jointly, which no weighted sum of and alone can express.
the question for next time
the feature map worked, but only because we knew was the right feature.
for XOR that's trivial: two inputs, one pairwise interaction to try. for our martians, we could guess . but for a real problem with a hundred measurements, there are possible pairwise interactions. three-way combinations? . you're not hand-crafting your way out of that.
the perceptron can learn any boundary in the space you give it. but it can't see past that space. if the feature that matters is a combination you didn't think to include, no amount of weight-nudging will find it. the flashlight can only illuminate dimensions that exist.
what if the machine could build its own dimensions?
that's the idea behind a neural network. instead of you designing and hoping, you give the machine a stack of layers. each layer computes weighted sums (exactly like a perceptron), then bends the result through a nonlinearity (like clamping negatives to zero). the bend is load-bearing: stack linear layers without it and the tower collapses, because a weighted sum of weighted sums is just one big weighted sum, and we spent this whole section proving what a single one can't do. with the bend, the early layers combine raw inputs into new features, and the later layers combine those into higher-order ones.
nobody tells the network about . training rewards whatever accidentally behaves like , so that behavior grows.
the final layer is still a dot product. everything we built in this essay (the score, the boundary, the sign check) is still there at the output. a neural network is a perceptron whose inputs are learned features rather than raw measurements.
but stacking layers creates a problem the perceptron never had.
in a single perceptron, when the machine miscalls a chonker, we know exactly who to blame: the one weight vector . we nudge it toward and move on. one knob, one gradient, one step.
in a network with three layers and thousands of weights, the output depends on all of them. when the prediction is wrong, the blame is smeared: maybe an early layer built a bad feature, maybe the last layer drew a bad boundary through good features, most likely some of both. every weight had a hand in the output, and each one deserves a different share of the blame.
credit assignment: which weight to blame, by how much. that's the next essay.
tl;dr (for the adhd crew)
- the machine: weighted sum of inputs → is it positive?
- the problem: we don't know the right weights
- the fix: when wrong,
- why it helps: margin increases by on the mistake example
- the catch: if a perfect line with breathing room exists, the total number of mistakes is capped at , where is the length of the longest input vector. if no such line exists, the updates can thrash forever
- the deeper catch: if your representation is wrong (xor), no amount of linear sweat will save you
- the open question: neural nets learn their own features, but when the output is wrong, which weights do you blame? (next essay: backpropagation)
exercises (pencil + paper)
the goal: you should be able to do these with a pencil, and you should know you're right at the end.
- start with (bias, height, weight)
- one martian: , true label ; measurements already standardized per the §1.8 recipe (nobody here is four centimeters tall)
- learning rate:
do this by hand:
- compute and the predicted label
- if it's wrong, apply the update
- recompute the score with the new weights
score = , so prediction is -1 (wrong). update:
new score = (positive), so it flips to +1. you can also verify the margin jump:
consider two points with a single feature :
- ,
- ,
now do two mini-cases:
- no bias term: you are forced to use (boundary must pass through 0)
- with bias: you can use
with no bias, the boundary is , so . both points have positive x, so they always land on the same side and can't be separated.
with bias, pick any threshold between 2 and 5, e.g. boundary . one valid choice:
then (A is +1) and (B is -1).
this dataset is not separable on purpose: two examples share identical features but disagree on the label.
use a 2d input with bias: .
training set (in this exact order), with and starting weights :
- A: ,
- B: ,
- C: , (label noise: same features as A, opposite label)
your job:
- run one epoch and write down after A, after B, and after C
- compute the pocket weight: among the weights you visited, which one makes the fewest mistakes on A/B/C?
- compute the averaged weight: = average of the three post-step weights, and count its mistakes on A/B/C
run the epoch:
- start
- A is misclassified (score = 0 → -1), so update:
- B is correct under (score = ), so no update (still )
- C is misclassified under (score = 3 → +1 but ), so update:
so the visited weights are:
- after A:
- after B:
- after C:
pocket: evaluate mistakes on A/B/C.
- with : A ✓, B ✓, C ✗ → 1 mistake
- with : A ✗, B ✗, C ✓ → 2 mistakes
so pocket keeps .
averaged: . this predicts A ✓, B ✓, C ✗ → 1 mistake.
the moral: when the data is non-separable, the last iterate can be garbage. pocket/average gives you something stable to ship.
the convergence proof made two promises: the mandate floor and the budget ceiling . verify both, mistake by mistake, on a dataset small enough for a pencil.
- A: ,
- B: ,
- C: ,
- D: ,
(no bias input this time; the boundary through the origin works here.) take the golden direction as given.
your job:
- compute the margin and the radius , then the mistake bound
- run the perceptron by hand: , , order A, B, C, D, ties predict . after every mistake , record , , and
- check the chain at each : is ? is ?
- at the second mistake, compute the middle term and verify the expansion against the direct computation
1. the constants. margins against : A: , B: , C: , D: . so . every point has , so . bound: . at most 5 mistakes, promised before training starts.
2-3. the race.
- A: score → tie → predict , label . mistake (). .
- mandate: ✓ (exactly on the floor)
- budget: ✓ (exactly on the ceiling)
- B: score → predict , label . mistake (). .
- mandate: ✓ (on the floor again; both mistakes were minimum-margin points)
- budget: ✓
- C: score → predict = label ✓. D: score → ✓.
- pass 2: A scores , B scores , C scores , D scores . all correct. converged after mistakes, comfortably under the promised 5.
4. the middle term, caught in the act. at mistake 2, the weights were and the point was with : the middle term is . expansion: . direct: . ✓. the update added an arrow of squared length 5 and the weights got shorter. the budget holds because the machine only ever eats points it disagrees with.
here's a 1D dataset:
| label | |
|---|---|
| 0 | -1 |
| 1 | +1 |
| 2 | +1 |
| 3 | -1 |
the +1s are sandwiched between the -1s.
your job:
- explain why no single threshold can separate these points
- propose a feature map that makes them separable (hint: think quadratic)
- find a weight vector that works, and verify it on all four points
a single threshold splits the number line into two half-lines: everything left of the cut, and everything right. but the +1s live in the middle. you need a feature that "closes" around the +1s: something that's big in the middle and small at the edges (or vice versa).
1. a linear boundary is a single cut point on the number line. one side is +1, the other is -1. but the +1 region here is an interval , not a half-line. you'd need two cuts. one threshold can't do it.
2. use . the quadratic term lets the boundary curve.
3. try . compute :
- : → -1 ✓
- : → +1 ✓
- : → +1 ✓
- : → -1 ✓
the quadratic is a downward parabola: positive at and , negative at the endpoints. the perceptron can't invent (you had to add it), but once you do, the problem is linearly separable in the lifted space.
using the 5-bullet recipe from §1.8 as your spec, write the full training loop. (don't write code? skip to the hand-trace tables in the solution; every row is checkable arithmetic.) here's a small dataset to test it on:
- A = ,
- B = ,
- C = ,
(already includes bias.) run 3 full passes through the data with , processing in order A, B, C each pass. report:
- final weights
- pocket weights (best seen)
- averaged weights
the recipe bullets are the spec. pocket = keep the weight vector with fewest mistakes on the full dataset. average = running sum of every weight vector seen, divided by total steps.
w = [0, 0, 0]
w_sum = [0, 0, 0]; t = 0
w_pocket = w; best_mistakes = count_mistakes(w, all_data)
for epoch in 1..N: # one pass through all points
# (shuffle in production; fixed order here for hand-tracing)
mistakes = 0
for (x, y) in data:
score = dot(w, x)
pred = +1 if score > 0 else -1
if pred != y:
w = w + eta * y * x
mistakes += 1
m = count_mistakes(w, all_data) # pocket check; w only changed here
if m < best_mistakes:
w_pocket = w; best_mistakes = m
w_sum += w; t += 1
if mistakes == 0: break
w_avg = w_sum / t
return w, w_pocket, w_avghand-trace, pass 1 (w starts at ):
| step | point | pred | label | mistake? | after | |
|---|---|---|---|---|---|---|
| 1 | A , | yes | ||||
| 2 | B , | no | ||||
| 3 | C , | yes |
after pass 1: . check all points:
- A: → pred , label . wrong
- B: → pred , label . wrong
- C: → pred , label . correct
the current weights make 2 mistakes, but the pocket isn't holding the current weights. it checks every weight vector the loop visits, and step 1's makes only 1 mistake on the full dataset (A: ✓, B: ✓, C: ✗). nothing since has beaten that, so the pocket holds .
continue passes 2-3 the same way. your implementation should reproduce this trace for pass 1.
pass 2 (w starts at ):
| step | point | pred | label | mistake? | after | |
|---|---|---|---|---|---|---|
| 4 | A | yes | ||||
| 5 | B | no | ||||
| 6 | C | yes |
end of pass 2: . neither visited weight beats the pocket's 1 mistake: step 4's ties it (only C wrong, at ), and makes 2 (A and B wrong). pocket stays .
pass 3 (w starts at ):
| step | point | pred | label | mistake? | after | |
|---|---|---|---|---|---|---|
| 7 | A | yes | ||||
| 8 | B | no | ||||
| 9 | C | no |
end of pass 3: . check all: A ✓, B ✓, C ✓. 0 mistakes. the pocket check at step 7 catches it the moment it's created: pocket updates to .
(note: point C lands exactly on the boundary with score 0. our code’s tie-breaker else -1 saves us here, but mathematically, you'd usually want a strictly non-zero score to claim true separation!)
final answers:
- final weights:
- pocket weights: (0 mistakes; converged, so pocket = final)
- averaged weights: ; gets A and B right but misclassifies C (1 mistake)