the perceptron: no thoughts, just dot products

1.1

the setup (mars edition)

we're on mars. we have two biological castes of martians:

🔵 chonkers (+1)

absolute units. dense.

built like refrigerators filled with other refrigerators.

🔴 skinny legends (-1)

gracile. hollow-boned.

built like a skittish sentient breadstick.

our job: build a machine that looks at a martian's measurements, then flashes the right light.

we have a clipboard. we start with two measurements:

$H$ = height in cm
$W$ = weight in kg

we write our measurements as a vector:

{\xx \mathbf{x}} = \begin{bmatrix} H \\ W \end{bmatrix}

on the scatter plots below, i’ll put weight on the horizontal axis and height on the vertical, because it reads cleaner. in the algebra, $\mathbf{x}$ just means “the measurements” — the order is arbitrary as long as $\mathbf{w}$ uses the same order.

1.2

drawing the line

before we do any math, solve the problem visually.

we want a decision boundary—a line that separates the chonkers from the skinny legends.

try it yourself. drag the handles on the axes to position the line so it separates the two groups.

now: what is that line, mathematically?

building the decision engine from scratch

how do we turn two numbers into one decision? we invent a score, then choose a cutoff: $\text{score} > c$ means chonk.

look at the scatter. in this toy dataset, the chonkers mostly live to the right: heavier.

so a decent first bet is a score that rises when either measurement rises—move right, score goes up; move up, score goes up.

\text{score} = H + W

this score ignores units (cm vs kg). but it buys us two crucial things immediately:

a single scalar we can threshold ("chonker if score > c")
a clean geometry: the decision boundary $\text{score}=c$ becomes a straight line (a diagonal cut between bottom-left and top-right)

now we fix the problems: cm and kg aren’t comparable, and sometimes one feature should count more than the other.

weights are exchange rates

if height should count double, you'd write:

\text{score} = 2H + W

that “2” is literally an exchange rate.

replace the hard-coded 1s with adjustable weights, $w_1$ and $w_2$ :

\text{score} = w_1 H + w_2 W

interpretation:

one cm of height buys you $w_1$ score points
one kg of weight buys you $w_2$ score points

it’s primitive, but it lets you put heterogeneous measurements on the same scoreboard.

bias: pick the cutoff

we need one more number: a cutoff $c$ . that’s the line in the sand.

if $\text{score} > c$ → chonker
if $\text{score} < c$ → skinny legend
if $\text{score} = c$ → exactly on the boundary

what does that boundary look like? plot every point where $\text{score} = c$ . you get a straight line—that's what you were dragging around.

the line isn't the thing.

it's just what "tied score" looks like in 2D. with three features it's a plane. with a hundred it's a hyperplane.

and here’s the small algebra trick that makes the rest of ML notation nicer: comparing to $c$ is the same as moving $c$ to the left and comparing to zero. define $b = -c$ and write:

\text{score} + b > 0

same boundary, cleaner sign check.

sanity check with real numbers

here's the equation you found:

test it on two martians:

H=175, W=78 (chonker): . score > cutoff → predicted chonker (+1).
H=160, W=52 (skinny legend): . score < cutoff → predicted skinny legend (-1).

three numbers → a machine that can tell them apart.

the general form

the weights $w_1, w_2$ control the angle of the boundary. the bias $b$ (or equivalently the cutoff $c = -b$ ) shifts it.

it works for any number of features. add age, antennas? just add more weighted terms. it's always a flat cut through the space (a "hyperplane").

one more notation trick: treat the bias as a weight $w_0$ attached to a constant input $x_0 = 1$ . then:

w_0 \cdot 1 + w_1 H + w_2 W = 0

which is a product of two vectors:

\begin{bmatrix} w_0 & w_1 & w_2 \end{bmatrix} \begin{bmatrix} 1 \\ H \\ W \end{bmatrix} = 0

so if we define:

{\xx \mathbf{x}} = \begin{bmatrix} 1 \\ H \\ W \end{bmatrix}, \quad {\ww \mathbf{w}} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}

the whole machine is:

{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0

one vector of weights, one vector of inputs. multiply and sum.

1.3

the geometry of the dot product

we have this equation ${\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0$ . what does it look like?

ignore the bias for a second. the weights $(w_1, w_2)$ aren't just exchange rates — together they form a vector, a direction in height-weight space. that direction is where the machine is looking: the combination of height and weight it cares about most.

the score measures how well a point $\mathbf{x}$ aligns with that direction:

\text{score} = \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta

where $\theta$ is the angle between them.

if they point in roughly the same direction ( $\theta < 90^\circ$ ), the score is positive.
if they point in opposite directions ( $\theta > 90^\circ$ ), the score is negative.
if they are exactly perpendicular ( $\theta = 90^\circ$ ), the score is zero.

and that last case is the boundary. $\text{score} = 0$ means "perpendicular to $\mathbf{w}$ ." the boundary isn't something we draw separately — it falls out of the geometry. every point perpendicular to the weight vector scores zero, and that set of points forms a line (in 2d), a plane (in 3d), a hyperplane (in general).

learning is just rotating the flashlight so that the beam illuminates the chonkers and leaves the skinny legends in the dark.

the weights are the classifier. change the weights, you rotate the boundary.

we won't "solve for the line." we'll start with random weights and watch the first mistake happen.

1.4

the incident (an actual example with actual numbers)

our machine boots up. the knobs are set randomly:

\mathbf{w} = \begin{bmatrix} 5 \\ 0.1 \\ -0.3 \end{bmatrix}

meaning:

bias = 5 (starts chonk-leaning)
height coefficient = 0.1 (taller → slightly more chonk-leaning)
weight coefficient = -0.3 (heavier → MORE skinny-leaning??? backwards, but it's random)

a martian waddles in:

ground truth: this is a CHONKER (+1)
measurements: height = 170 cm, weight = 76 kg

\mathbf{x} = \begin{bmatrix} 1 \\ 170 \\ 76 \end{bmatrix}

the machine computes:

\text{score} = 5 \cdot 1 + 0.1 \cdot 170 + (-0.3) \cdot 76

step by step:

bias term: $5 \times 1 = 5$
height term: $0.1 \times 170 = 17$
weight term: $-0.3 \times 76 = -22.8$

\text{score} = 5 + 17 - 22.8 = \mathbf{-0.8}

score is negative. machine predicts skinny legend (-1).

but this martian is a CHONKER.

false negative.

geometrically: $\mathbf{w}^\top\mathbf{x} < 0$ , so this point is on the “negative” side of the boundary — behind the beam.

1.5

learning from mistakes (the update)

our machine called a chonker a skinny legend. the beam is pointing the wrong way. we need to change the weights.

take one training example $(\mathbf{x}, y)$ where $y \in \{-1, +1\}$ .

score = ${\ww \mathbf{w}}^\top {\xx \mathbf{x}}$
prediction = $\text{sign}(\text{score})$

being wrong is not mystical. when the prediction and the label agree, $y \cdot \text{score}$ is positive (two positives or two negatives). when they disagree, it's negative. so "wrong" collapses to a single inequality:

y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) < 0

now ask the only useful question:

what change to $\mathbf{w}$ makes $y(\mathbf{w}^\top\mathbf{x})$ go up the fastest?

let's write the effect of an update from first principles. suppose we add some $\Delta\mathbf{w}$ :

y(({\ww \mathbf{w}} + \Delta{\ww \mathbf{w}})^\top {\xx \mathbf{x}}) = y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + y(\Delta{\ww \mathbf{w}}^\top {\xx \mathbf{x}})

the first term is the old mistake. the second term is the only knob we can turn.

another way to see the same physics: the score is linear in the weights.

if you change one weight by $\Delta w_i$ , the score changes by $\Delta w_i \cdot x_i$ .

so the example itself tells you where the leverage is. for our incident $\mathbf{x}=[1,170,76]$ :

add 1 to $w_0$ → score changes by 1
add 1 to $w_1$ → score changes by 170
add 1 to $w_2$ → score changes by 76

so we want $\Delta\mathbf{w}$ to have a big positive dot product with $y\mathbf{x}$ .

and we already know what dot products measure: alignment.

if you hold the step size fixed (say, $\|\Delta\mathbf{w}\| = 1$ ), the dot product $\Delta\mathbf{w}^\top (y\mathbf{x}) = \|\Delta\mathbf{w}\|\,\|\mathbf{x}\|\cos\theta$ is maximized at $\theta = 0$ — when $\Delta\mathbf{w}$ points in the same direction as $y\mathbf{x}$ .

so the simplest update is: point $\Delta\mathbf{w}$ directly at $y\mathbf{x}$ , with some step size $\eta$ (the learning rate).

{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta \cdot y \cdot {\xx \mathbf{x}}

the perceptron update (error-driven)

when $y=+1$ you add $\mathbf{x}$ ; when $y=-1$ you subtract $\mathbf{x}$ . same rule.

that’s the update: when you’re wrong, move toward the labeled point.

in pseudocode:

initialize w
repeat for epochs:          # one epoch = one pass through all points
  for (x, y) in data:
    if y * dot(w, x) < 0:    # mistake
      w = w + η * y * x

1.6

why this helps (the one-step guarantee)

without this trick, we'd need two cases: "if $y=+1$ , check that the score is positive; if $y=-1$ , check that it's negative." multiplying by $y$ collapses both into one number — the signed margin — that should always be positive:

m = y \cdot ({\ww \mathbf{w}}^\top {\xx \mathbf{x}})

$m > 0$ : correct side
$m < 0$ : wrong side (update fires)

after an update ${\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}$ :

\begin{aligned} m_{new} &= y(({\ww \mathbf{w}} + \eta y {\xx \mathbf{x}})^\top {\xx \mathbf{x}}) \\ &= y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + \eta \cdot y^2 \cdot \|{\xx \mathbf{x}}\|^2 \\ &= m_{old} + \eta \|{\xx \mathbf{x}}\|^2 \end{aligned}

signed-margin improvement on that example

since $\eta > 0$ and $\|{\xx \mathbf{x}}\|^2 > 0$ for any real input, the signed margin strictly increases on that example.

so far we have a local guarantee: one update makes you less wrong on that point. it doesn't, by itself, promise the whole dataset will stop producing mistakes. now the global question:

can the dataset keep forcing updates forever?

1.7

the inevitable crash (why it stops)

here's the argument that says no. if the data is separable, the perceptron cannot loop forever. our machine sorting martians must eventually stop misclassifying them.

not "probably won't." cannot.

we can prove this by tracking two physical quantities that are on a collision course.

the mandate (must grow fast)

the update adds $\eta y\mathbf{x}$ , which always has a positive component along $\mathbf{u}$ (at least $\eta\gamma$ , by definition of margin).

\mathbf{w}^\top \mathbf{u} \ge k \cdot (\eta \gamma)

the projection grows linearly with mistakes $k$ .

the budget (can only grow slow)

every mistake adds at most a bounded amount of "energy" (squared length) to $\mathbf{w}$ .

\|\mathbf{w}\|^2 \le k \cdot (\eta R)^2

the length grows at most like $\sqrt{k}$ .

do you see the problem? (drag the slider).

the mandate (dot product) demands linear growth ( $k$ ).
the budget (length) only allows square-root growth ( $\sqrt{k}$ ).

but $\mathbf{w}^\top \mathbf{u} \le \|\mathbf{w}\|\,\|\mathbf{u}\| = \|\mathbf{w}\|$ (cauchy-schwarz, using $\|\mathbf{u}\|=1$ ). a vector's projection onto a unit vector can never exceed the vector's own length.

\mathbf{w}^\top \mathbf{u} \le \|\mathbf{w}\|

so:

k \cdot (\eta \gamma) \le \sqrt{k} \cdot (\eta R)

a linear function of $k$ will always eventually overtake a square root of $k$ . but a projection can never outgrow the vector itself — that's geometry, not an approximation. so the only escape is that $k$ stops growing: the algorithm must run out of mistakes.

let's pin down the number. the "collision" happens when the lower bound hits the upper bound:

k \eta \gamma \le \sqrt{k} \eta R

divide by $\eta \sqrt{k}$ :

\sqrt{k} \gamma \le R

square both sides:

k \gamma^2 \le R^2 \implies k \le \frac{R^2}{\gamma^2}

so the number of mistakes $k$ is bounded by $(R/\gamma)^2$ . calculating for eternity is physically impossible.

if the data isn't separable (no such $\mathbf{u}$ exists), the mandate doesn't exist — there's no $\gamma > 0$ to force growth — and the perceptron can thrash forever. that's when we use the pocket algorithm (keep the best weights seen so far) or switch to a soft-margin classifier like an SVM.

1.8

making it work

the convergence proof says the perceptron must stop. it doesn't say it stops gracefully.

remember our incident martian from 1.4? $\mathbf{x} = [1,\, 170,\, 76]$ . with $\eta = 1$ , the margin jump on each update is $\eta\|{\xx \mathbf{x}}\|^2 = 1^2 + 170^2 + 76^2 = 34{,}677$ . one mistake and the weights are in orbit.

why so violent? because height is in centimeters and weight is in kilograms. the height coordinate is roughly $100\times$ louder than everything else, so it dominates the squared norm. the convergence bound is $(R/\gamma)^2$ — if $R$ is inflated by one unscaled feature, the bound blows up quadratically.

the fix is boring and universal: standardize features ( $x_i \leftarrow (x_i - \mu_i)/\sigma_i$ ) so no single dimension dominates. after standardization, our martian's coordinates are order-1 instead of order-100, the updates stay sane, and the convergence bound tightens.

the recipe

standardize features: $x_i \leftarrow (x_i - \mu_i)/\sigma_i$ (save $\mu_i,\sigma_i$ ).
add bias: include $x_0 = 1$ so the boundary can shift.
shuffle each epoch: same points, fewer dumb cycles.
early stop: stop when mistakes hit 0 (separable) or validation stops improving (non-separable).
average or pocket: when in doubt, ship $\bar{\mathbf{w}}$ or the best $\mathbf{w}$ seen.

this handles the engineering. but there's a deeper problem no amount of tuning can fix: what if the data can't be separated by any line?

1.9

escaping flatland (xor + why neural nets exist)

four points at the corners of a square. label them by diagonal: one diagonal pair is +1, the other is -1.

A	B	output
0	0	-1
0	1	+1
1	0	+1
1	1	-1

try to draw a line that separates them.

you can't. sweep the boundary through every angle — every orientation catches one +1 and one -1 on each side. the perceptron will oscillate forever, shoving the boundary back and forth, never settling.

this isn't bad luck. it's geometry. a line can only cut the plane into two half-planes, and no half-plane contains exactly one diagonal of a square. the +1s and -1s alternate around the corners.

the flashlight tells you why. the score $w_1 A + w_2 B$ sees each input's contribution independently — it knows "A is big" and "B is big" as separate facts.

what it can't see is whether A and B agree. that's not a direction in the input space. it's a relationship between directions, and no weighted sum can compute it.

there are two ways out:

change the representation (feature engineering)
learn the representation (a neural network)

1.9.1

the feature-map escape hatch (make xor separable)

if the flashlight can't see the pattern, give it a new dimension to look along. not a random one — you need a feature that encodes the relationship the beam is missing.

for XOR, the missing fact is "both are on." that's literally $A \cdot B$ .

define a feature map:

\phi(A,B) = (1, A, B, A\cdot B)

the mapped points are:

point	$\phi(A,B)$	label
(0,0)	$(1, 0, 0, 0)$	-1
(0,1)	$(1, 0, 1, 0)$	+1
(1,0)	$(1, 1, 0, 0)$	+1
(1,1)	$(1, 1, 1, 1)$	-1

one separating weight vector is $\mathbf{w} = [-1, 2, 2, -4]$ . check:

$(0,0)$ : $-1 + 0 + 0 + 0 = -1$ → -1 ✓
$(0,1)$ : $-1 + 0 + 2 + 0 = 1$ → +1 ✓
$(1,0)$ : $-1 + 2 + 0 + 0 = 1$ → +1 ✓
$(1,1)$ : $-1 + 2 + 2 - 4 = -1$ → -1 ✓

translation: “if BOTH inputs are 1, subtract 4 from the score.” that feature makes XOR separable.

but this is the catch: you had to know to add $A\cdot B$ . the perceptron can't see features it wasn't given — $A \cdot B$ was in the beam's blind spot. it can only learn weights for the dimensions you hand it.

that $A\cdot B$ term is the simplest interaction feature: it turns “both are on” into a single coordinate.

in the martian story, $H\cdot W$ is the same move. it gives you a knob for “tall and heavy” that a purely linear score can’t express directly.

1.9.2

the question for next time

the feature map worked. but notice what it cost us: we had to know that $A\cdot B$ was the right feature.

for XOR that's trivial — two inputs, one pairwise interaction to try. for our martians, we could guess $H\cdot W$ . but for a real problem with a hundred measurements, the pairwise interactions alone number $\binom{100}{2} = 4{,}950$ . three-way? $161{,}700$ . you're not hand-crafting your way out of that.

the perceptron can learn any boundary in the space you give it. but it can't see past that space. if the feature that matters is a combination you didn't think to include, no amount of weight-nudging will find it. the flashlight can only illuminate dimensions that exist.

what if the machine could build its own dimensions?

that's the idea behind a neural network. instead of you designing $A\cdot B$ and hoping, you give the machine a stack of layers. each layer computes weighted sums — exactly like a perceptron — then bends the result through a nonlinearity (like clamping negatives to zero). the early layers combine raw inputs into new features. the later layers combine those into higher-order ones.

the network doesn't know it needs $A\cdot B$ . it builds something that works like $A\cdot B$ because the training signal rewards it. the feature map isn't designed. it's grown.

and the beautiful part: the final layer is still a dot product. everything we built in this essay — the score, the boundary, the sign check — is still there at the output. a neural network is a perceptron whose inputs are learned features rather than raw measurements.

but stacking layers creates a problem the perceptron never had.

in a single perceptron, when the machine miscalls a chonker, we know exactly who to blame: the one weight vector $\mathbf{w}$ . we nudge it toward $y\mathbf{x}$ and move on. one knob, one gradient, one step.

in a network with three layers and thousands of weights, the output depends on all of them. when the prediction is wrong, which layer screwed up? the first layer, which built a bad feature? the second, which combined good features badly? the last, which drew the boundary in the wrong place? every weight had a hand in the output, and each one deserves a different share of the blame.

that's the credit assignment problem, and solving it is the subject of the next essay.

1.10

tl;dr (for the adhd crew)

the machine: weighted sum of inputs → is it positive?
the problem: we don't know the right weights
the fix: when wrong, ${\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}$
why it helps: margin increases by $\eta \|{\xx \mathbf{x}}\|^2$ on the mistake example
the catch: convergence needs separability (otherwise it can thrash)
the deeper catch: if your representation is wrong (xor), no amount of linear sweat will save you
the open question: neural nets learn their own features — but when the output is wrong, which weights do you blame? (next essay: backpropagation)

1.11

exercises (pencil + paper)

the goal: you should be able to do these with a pencil, and you should know you're right at the end.

1One Update, No Excuses

Question

start with $\mathbf{w} = [2, -1, 0.5]$ (bias, height, weight)
one martian: $\mathbf{x} = [1, 4, 3]$ , true label $y = +1$
learning rate: $\eta = 1$

do this by hand:

compute $\text{score} = \mathbf{w}^\top \mathbf{x}$ and the predicted label
if it's wrong, apply the update $\mathbf{w} \leftarrow \mathbf{w} + \eta y \mathbf{x}$
recompute the score with the new weights

score = $2 + (-1)\cdot 4 + 0.5 \cdot 3 = -0.5$ , so prediction is -1 (wrong). update:

\mathbf{w}_{new} = [2, -1, 0.5] + (+1)[1,4,3] = [3, 3, 3.5]

new score = $3 + 3\cdot 4 + 3.5\cdot 3 = 25.5$ (positive), so it flips to +1. you can also verify the margin jump:

m_{new} = m_{old} + \eta \|\mathbf{x}\|^2 = (-0.5) + (1^2 + 4^2 + 3^2) = 25.5

2Bias Matters

Question

consider two points with labels:

$\mathbf{x}_A = [1, 2]$ , $y_A = +1$
$\mathbf{x}_B = [1, 5]$ , $y_B = -1$

now do two mini-cases:

no bias term: you are forced to use $\text{score} = w_1 x$ (boundary must pass through 0)
with bias: you can use $\text{score} = w_0 + w_1 x$

with no bias, the boundary is $w_1 x = 0$ , so $x = 0$ . both points have positive x, so they always land on the same side and can't be separated.

with bias, pick any threshold between 2 and 5, e.g. boundary $x = 3.5$ . one valid choice:

w_0 = 3.5,\; w_1 = -1

then $\text{score}_A = 3.5 - 2 = 1.5 > 0$ (A is +1) and $\text{score}_B = 3.5 - 5 = -1.5 < 0$ (B is -1).

3When It Doesn't Converge (Pocket vs Averaged)

Question

this dataset is not separable on purpose: two examples share identical features but disagree on the label.

use a 2d input with bias: $\mathbf{x} = [1, x_1, x_2]$ .

training set (in this exact order), with $\eta = 1$ and starting weights $\mathbf{w} = [0,0,0]$ :

A: $\mathbf{x}_A = [1, 1, 1]$ , $y_A = +1$
B: $\mathbf{x}_B = [1, -1, 2]$ , $y_B = +1$
C: $\mathbf{x}_C = [1, 1, 1]$ , $y_C = -1$ (label noise: same features as A, opposite label)

your job:

run one epoch and write down $\mathbf{w}$ after A, after B, and after C
compute the pocket weight: among the weights you visited, which one makes the fewest mistakes on A/B/C?
compute the averaged weight: $\bar{\mathbf{w}}$ = average of the three post-step weights, and count its mistakes on A/B/C

run the epoch:

start $\mathbf{w} = [0,0,0]$
A is misclassified (score = 0 → -1), so update: $\mathbf{w} \leftarrow \mathbf{w} + (+1)[1,1,1] = [1,1,1]$
B is correct under $[1,1,1]$ (score = $1 + (-1) + 2 = 2$ ), so no update (still $[1,1,1]$ )
C is misclassified under $[1,1,1]$ (score = 3 → +1 but $y=-1$ ), so update: $\mathbf{w} \leftarrow [1,1,1] + (-1)[1,1,1] = [0,0,0]$

so the visited weights are:

after A: $[1,1,1]$
after B: $[1,1,1]$
after C: $[0,0,0]$

pocket: evaluate mistakes on A/B/C.

with $[1,1,1]$ : A ✓, B ✓, C ✗ → 1 mistake
with $[0,0,0]$ : A ✗, B ✗, C ✓ → 2 mistakes

so pocket keeps $[1,1,1]$ .

averaged: $\bar{\mathbf{w}} = \frac{[1,1,1] + [1,1,1] + [0,0,0]}{3} = [2/3, 2/3, 2/3]$ . this predicts A ✓, B ✓, C ✗ → 1 mistake.

the moral: when the data is non-separable, the last iterate can be garbage. pocket/average gives you something stable to ship.

BonusEscape the Number Line

Question

here's a 1D dataset:

$x$	label
0	-1
1	+1
2	+1
3	-1

the +1s are sandwiched between the -1s.

your job:

explain why no single threshold $w_0 + w_1 x = 0$ can separate these points
propose a feature map $\phi(x)$ that makes them separable (hint: think quadratic)
find a weight vector $\mathbf{w}$ that works, and verify it on all four points

a single threshold splits the number line into two half-lines: everything left of the cut, and everything right. but the +1s live in the middle. you need a feature that "closes" around the +1s — something that's big in the middle and small at the edges (or vice versa).

1. a linear boundary $w_0 + w_1 x = 0$ is a single cut point on the number line. one side is +1, the other is -1. but the +1 region here is an interval $[1,2]$ , not a half-line. you'd need two cuts — one threshold can't do it.

2. use $\phi(x) = (1, x, x^2)$ . the quadratic term lets the boundary curve.

3. try $\mathbf{w} = [-1, 3, -1]$ . compute $\mathbf{w}^\top \phi(x) = -1 + 3x - x^2$ :

$x=0$ : $-1 + 0 - 0 = -1$ → -1 ✓
$x=1$ : $-1 + 3 - 1 = 1$ → +1 ✓
$x=2$ : $-1 + 6 - 4 = 1$ → +1 ✓
$x=3$ : $-1 + 9 - 9 = -1$ → -1 ✓

the quadratic $-x^2 + 3x - 1$ is a downward parabola — positive at $x=1$ and $x=2$ , negative at the endpoints. the perceptron can't invent $x^2$ — you had to add it — but once you do, the problem is linearly separable in the lifted space.

Bonus 2Implement the Loop

Question

using the 5-bullet recipe from §1.8 as your spec, write the full training loop. here's a small dataset to test it on:

A = $(1, 1, 1)$ , $y = +1$
B = $(1, 2, 1)$ , $y = +1$
C = $(1, 0.5, 2)$ , $y = -1$

(already includes bias.) run 3 full passes through the data with $\eta = 1$ , processing in order A, B, C each pass. report:

final weights
pocket weights (best seen)
averaged weights

the recipe bullets are the spec. pocket = keep the weight vector with fewest mistakes on the full dataset. average = running sum of every weight vector seen, divided by total steps.

w = [0, 0, 0]
w_sum = [0, 0, 0]; t = 0
w_pocket = w; best_mistakes = inf

for epoch in 1..N:          # one pass through all points
  # (shuffle in production; fixed order here for hand-tracing)
  mistakes = 0
  for (x, y) in data:
    score = dot(w, x)
    pred = +1 if score > 0 else -1
    if pred != y:
      w = w + eta * y * x
      mistakes += 1
    w_sum += w; t += 1

  total_mistakes = count_mistakes(w, all_data)
  if total_mistakes < best_mistakes:
    w_pocket = w; best_mistakes = total_mistakes
  if mistakes == 0: break

w_avg = w_sum / t
return w, w_pocket, w_avg

hand-trace, pass 1 (w starts at $[0,0,0]$ ):

step	point	$\mathbf{w}^\top \mathbf{x}$	pred	label	mistake?	$\mathbf{w}$ after
1	A $(1,1,1)$ , $y=+1$	$0$	$-1$	$+1$	yes	$[1,1,1]$
2	B $(1,2,1)$ , $y=+1$	$1+2+1=4$	$+1$	$+1$	no	$[1,1,1]$
3	C $(1,0.5,2)$ , $y=-1$	$1+0.5+2=3.5$	$+1$	$-1$	yes	$[0, 0.5, -1]$

after pass 1: $\mathbf{w} = [0, 0.5, -1]$ . check all points:

A: $0 + 0.5 - 1 = -0.5$ → pred $-1$ , label $+1$ — wrong
B: $0 + 1 - 1 = 0$ → pred $-1$ , label $+1$ — wrong
C: $0 + 0.25 - 2 = -1.75$ → pred $-1$ , label $-1$ — correct

2 mistakes on full dataset. pocket stores $[0, 0.5, -1]$ with 2 mistakes (better than $\infty$ , the starting count).

continue passes 2-3 the same way. your implementation should reproduce this trace for pass 1.

pass 2 (w starts at $[0, 0.5, -1]$ ):

step	point	$\mathbf{w}^\top \mathbf{x}$	pred	label	mistake?	$\mathbf{w}$ after
4	A	$0 + 0.5 - 1 = -0.5$	$-1$	$+1$	yes	$[1, 1.5, 0]$
5	B	$1 + 3 + 0 = 4$	$+1$	$+1$	no	$[1, 1.5, 0]$
6	C	$1 + 0.75 + 0 = 1.75$	$+1$	$-1$	yes	$[0, 1, -2]$

end of pass 2: $\mathbf{w} = [0, 1, -2]$ . check all: A $-1$ (wrong), B $0 \to -1$ (wrong), C $-3.5$ (correct). 2 mistakes — no pocket improvement.

pass 3 (w starts at $[0, 1, -2]$ ):

step	point	$\mathbf{w}^\top \mathbf{x}$	pred	label	mistake?	$\mathbf{w}$ after
7	A	$0 + 1 - 2 = -1$	$-1$	$+1$	yes	$[1, 2, -1]$
8	B	$1 + 4 - 1 = 4$	$+1$	$+1$	no	$[1, 2, -1]$
9	C	$1 + 1 - 2 = 0$	$-1$	$-1$	no	$[1, 2, -1]$

end of pass 3: $\mathbf{w} = [1, 2, -1]$ . check all: A $= 2$ ✓, B $= 4$ ✓, C $= 0 \to -1$ ✓. 0 mistakes. pocket updates to $[1, 2, -1]$ .

final answers:

final weights: $[1, 2, -1]$
pocket weights: $[1, 2, -1]$ (0 mistakes — converged, so pocket = final)
averaged weights: $\bar{\mathbf{w}} = \frac{1}{9}\sum_{t=1}^{9} \mathbf{w}_t = [7/9,\; 12.5/9,\; -4/9] \approx [0.78,\; 1.39,\; -0.44]$ — gets A and B right but misclassifies C (1 mistake)