← Home
11th January, 2026·Andrew R. Louisingh

the perceptron: no thoughts, just dot products

It’s just y = mx + b with a God complex.

1.1

the setup (mars edition)

we're on mars. we have two biological castes of martians:

🔵 chonkers (+1)

absolute units. dense.

built like refrigerators filled with other refrigerators.

🔴 skinny legends (-1)

gracile. hollow-boned.

built like a skittish sentient breadstick.

our job: build a machine that looks at a martian's measurements, then flashes the right light.

we have a clipboard. we start with two measurements:

we write our measurements as a vector:

x=[HW]{\xx \mathbf{x}} = \begin{bmatrix} H \\ W \end{bmatrix}

on the scatter plots below, i’ll put weight on the horizontal axis and height on the vertical, because it reads cleaner. in the algebra, x\mathbf{x} just means “the measurements” — the order is arbitrary as long as w\mathbf{w} uses the same order.

1.2

drawing the line

before we do any math, solve the problem visually.

we want a decision boundary—a line that separates the chonkers from the skinny legends.

try it yourself. drag the handles on the axes to position the line so it separates the two groups.

now: what is that line, mathematically?

building the decision engine from scratch

how do we turn two numbers into one decision? we invent a score, then choose a cutoff: score>c\text{score} > c means chonk.

look at the scatter. in this toy dataset, the chonkers mostly live to the right: heavier.

so a decent first bet is a score that rises when either measurement rises—move right, score goes up; move up, score goes up.

score=H+W\text{score} = H + W

this score ignores units (cm vs kg). but it buys us two crucial things immediately:

  1. a single scalar we can threshold ("chonker if score > c")
  2. a clean geometry: the decision boundary score=c\text{score}=c becomes a straight line (a diagonal cut between bottom-left and top-right)

now we fix the problems: cm and kg aren’t comparable, and sometimes one feature should count more than the other.

weights are exchange rates

if height should count double, you'd write:

score=2H+W\text{score} = 2H + W

that “2” is literally an exchange rate.

replace the hard-coded 1s with adjustable weights, w1w_1 and w2w_2:

score=w1H+w2W\text{score} = w_1 H + w_2 W

interpretation:

it’s primitive, but it lets you put heterogeneous measurements on the same scoreboard.

bias: pick the cutoff

we need one more number: a cutoff cc. that’s the line in the sand.

what does that boundary look like? plot every point where score=c\text{score} = c. you get a straight line—that's what you were dragging around.

the line isn't the thing.

it's just what "tied score" looks like in 2D. with three features it's a plane. with a hundred it's a hyperplane.

and here’s the small algebra trick that makes the rest of ML notation nicer: comparing to cc is the same as moving cc to the left and comparing to zero. define b=cb = -c and write:

score+b>0\text{score} + b > 0

same boundary, cleaner sign check.

sanity check with real numbers

here's the equation you found:

test it on two martians:

three numbers → a machine that can tell them apart.

the general form

the weights w1,w2w_1, w_2 control the angle of the boundary. the bias bb (or equivalently the cutoff c=bc = -b) shifts it.

it works for any number of features. add age, antennas? just add more weighted terms. it's always a flat cut through the space (a "hyperplane").

one more notation trick: treat the bias as a weight w0w_0 attached to a constant input x0=1x_0 = 1. then:

w01+w1H+w2W=0w_0 \cdot 1 + w_1 H + w_2 W = 0

which is a product of two vectors:

[w0w1w2][1HW]=0\begin{bmatrix} w_0 & w_1 & w_2 \end{bmatrix} \begin{bmatrix} 1 \\ H \\ W \end{bmatrix} = 0

so if we define:

x=[1HW],w=[w0w1w2]{\xx \mathbf{x}} = \begin{bmatrix} 1 \\ H \\ W \end{bmatrix}, \quad {\ww \mathbf{w}} = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \end{bmatrix}

the whole machine is:

wx=0{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0

one vector of weights, one vector of inputs. multiply and sum.

1.3

the geometry of the dot product

we have this equation wx=0{\ww \mathbf{w}}^\top {\xx \mathbf{x}} = 0. what does it look like?

ignore the bias for a second. the weights (w1,w2)(w_1, w_2) aren't just exchange rates — together they form a vector, a direction in height-weight space. that direction is where the machine is looking: the combination of height and weight it cares about most.

the score measures how well a point x\mathbf{x} aligns with that direction:

score=wx=wxcosθ\text{score} = \mathbf{w} \cdot \mathbf{x} = \|\mathbf{w}\| \|\mathbf{x}\| \cos \theta

where θ\theta is the angle between them.

and that last case is the boundary. score=0\text{score} = 0 means "perpendicular to w\mathbf{w}." the boundary isn't something we draw separately — it falls out of the geometry. every point perpendicular to the weight vector scores zero, and that set of points forms a line (in 2d), a plane (in 3d), a hyperplane (in general).

learning is just rotating the flashlight so that the beam illuminates the chonkers and leaves the skinny legends in the dark.

the weights are the classifier. change the weights, you rotate the boundary.

we won't "solve for the line." we'll start with random weights and watch the first mistake happen.

1.4

the incident (an actual example with actual numbers)

our machine boots up. the knobs are set randomly:

w=[50.10.3]\mathbf{w} = \begin{bmatrix} 5 \\ 0.1 \\ -0.3 \end{bmatrix}

meaning:

a martian waddles in:

x=[117076]\mathbf{x} = \begin{bmatrix} 1 \\ 170 \\ 76 \end{bmatrix}

the machine computes:

score=51+0.1170+(0.3)76\text{score} = 5 \cdot 1 + 0.1 \cdot 170 + (-0.3) \cdot 76

step by step:

score=5+1722.8=0.8\text{score} = 5 + 17 - 22.8 = \mathbf{-0.8}

score is negative. machine predicts skinny legend (-1).

but this martian is a CHONKER.

false negative.

geometrically: wx<0\mathbf{w}^\top\mathbf{x} < 0, so this point is on the “negative” side of the boundary — behind the beam.

1.5

learning from mistakes (the update)

our machine called a chonker a skinny legend. the beam is pointing the wrong way. we need to change the weights.

take one training example (x,y)(\mathbf{x}, y) where y{1,+1}y \in \{-1, +1\}.

being wrong is not mystical. when the prediction and the label agree, yscorey \cdot \text{score} is positive (two positives or two negatives). when they disagree, it's negative. so "wrong" collapses to a single inequality:

y(wx)<0y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) < 0

now ask the only useful question:

what change to w\mathbf{w} makes y(wx)y(\mathbf{w}^\top\mathbf{x}) go up the fastest?

let's write the effect of an update from first principles. suppose we add some Δw\Delta\mathbf{w}:

y((w+Δw)x)=y(wx)+y(Δwx)y(({\ww \mathbf{w}} + \Delta{\ww \mathbf{w}})^\top {\xx \mathbf{x}}) = y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + y(\Delta{\ww \mathbf{w}}^\top {\xx \mathbf{x}})

the first term is the old mistake. the second term is the only knob we can turn.

another way to see the same physics: the score is linear in the weights.

if you change one weight by Δwi\Delta w_i, the score changes by Δwixi\Delta w_i \cdot x_i.

so the example itself tells you where the leverage is. for our incident x=[1,170,76]\mathbf{x}=[1,170,76]:

so we want Δw\Delta\mathbf{w} to have a big positive dot product with yxy\mathbf{x}.

and we already know what dot products measure: alignment.

if you hold the step size fixed (say, Δw=1\|\Delta\mathbf{w}\| = 1), the dot product Δw(yx)=Δwxcosθ\Delta\mathbf{w}^\top (y\mathbf{x}) = \|\Delta\mathbf{w}\|\,\|\mathbf{x}\|\cos\theta is maximized at θ=0\theta = 0 — when Δw\Delta\mathbf{w} points in the same direction as yxy\mathbf{x}.

so the simplest update is: point Δw\Delta\mathbf{w} directly at yxy\mathbf{x}, with some step size η\eta (the learning rate).

ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta \cdot y \cdot {\xx \mathbf{x}}
the perceptron update (error-driven)

when y=+1y=+1 you add x\mathbf{x}; when y=1y=-1 you subtract x\mathbf{x}. same rule.

that’s the update: when you’re wrong, move toward the labeled point.

in pseudocode:

initialize w
repeat for epochs:          # one epoch = one pass through all points
  for (x, y) in data:
    if y * dot(w, x) < 0:    # mistake
      w = w + η * y * x
1.6

why this helps (the one-step guarantee)

without this trick, we'd need two cases: "if y=+1y=+1, check that the score is positive; if y=1y=-1, check that it's negative." multiplying by yy collapses both into one number — the signed margin — that should always be positive:

m=y(wx)m = y \cdot ({\ww \mathbf{w}}^\top {\xx \mathbf{x}})

after an update ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}:

mnew=y((w+ηyx)x)=y(wx)+ηy2x2=mold+ηx2\begin{aligned} m_{new} &= y(({\ww \mathbf{w}} + \eta y {\xx \mathbf{x}})^\top {\xx \mathbf{x}}) \\ &= y({\ww \mathbf{w}}^\top {\xx \mathbf{x}}) + \eta \cdot y^2 \cdot \|{\xx \mathbf{x}}\|^2 \\ &= m_{old} + \eta \|{\xx \mathbf{x}}\|^2 \end{aligned}
signed-margin improvement on that example

since η>0\eta > 0 and x2>0\|{\xx \mathbf{x}}\|^2 > 0 for any real input, the signed margin strictly increases on that example.

so far we have a local guarantee: one update makes you less wrong on that point. it doesn't, by itself, promise the whole dataset will stop producing mistakes. now the global question:

can the dataset keep forcing updates forever?

1.7

the inevitable crash (why it stops)

here's the argument that says no. if the data is separable, the perceptron cannot loop forever. our machine sorting martians must eventually stop misclassifying them.

not "probably won't." cannot.

we can prove this by tracking two physical quantities that are on a collision course.

the mandate (must grow fast)

the update adds ηyx\eta y\mathbf{x}, which always has a positive component along u\mathbf{u} (at least ηγ\eta\gamma, by definition of margin).

wuk(ηγ)\mathbf{w}^\top \mathbf{u} \ge k \cdot (\eta \gamma)

the projection grows linearly with mistakes kk.

the budget (can only grow slow)

every mistake adds at most a bounded amount of "energy" (squared length) to w\mathbf{w}.

w2k(ηR)2\|\mathbf{w}\|^2 \le k \cdot (\eta R)^2

the length grows at most like k\sqrt{k}.

do you see the problem? (drag the slider).

but wuwu=w\mathbf{w}^\top \mathbf{u} \le \|\mathbf{w}\|\,\|\mathbf{u}\| = \|\mathbf{w}\| (cauchy-schwarz, using u=1\|\mathbf{u}\|=1). a vector's projection onto a unit vector can never exceed the vector's own length.

wuw\mathbf{w}^\top \mathbf{u} \le \|\mathbf{w}\|

so:

k(ηγ)k(ηR)k \cdot (\eta \gamma) \le \sqrt{k} \cdot (\eta R)

a linear function of kk will always eventually overtake a square root of kk. but a projection can never outgrow the vector itself — that's geometry, not an approximation. so the only escape is that kk stops growing: the algorithm must run out of mistakes.

let's pin down the number. the "collision" happens when the lower bound hits the upper bound:

kηγkηRk \eta \gamma \le \sqrt{k} \eta R

divide by ηk\eta \sqrt{k}:

kγR\sqrt{k} \gamma \le R

square both sides:

kγ2R2    kR2γ2k \gamma^2 \le R^2 \implies k \le \frac{R^2}{\gamma^2}

so the number of mistakes kk is bounded by (R/γ)2(R/\gamma)^2. calculating for eternity is physically impossible.

if the data isn't separable (no such u\mathbf{u} exists), the mandate doesn't exist — there's no γ>0\gamma > 0 to force growth — and the perceptron can thrash forever. that's when we use the pocket algorithm (keep the best weights seen so far) or switch to a soft-margin classifier like an SVM.

1.8

making it work

the convergence proof says the perceptron must stop. it doesn't say it stops gracefully.

remember our incident martian from 1.4? x=[1,170,76]\mathbf{x} = [1,\, 170,\, 76]. with η=1\eta = 1, the margin jump on each update is ηx2=12+1702+762=34,677\eta\|{\xx \mathbf{x}}\|^2 = 1^2 + 170^2 + 76^2 = 34{,}677. one mistake and the weights are in orbit.

why so violent? because height is in centimeters and weight is in kilograms. the height coordinate is roughly 100×100\times louder than everything else, so it dominates the squared norm. the convergence bound is (R/γ)2(R/\gamma)^2 — if RR is inflated by one unscaled feature, the bound blows up quadratically.

the fix is boring and universal: standardize features (xi(xiμi)/σix_i \leftarrow (x_i - \mu_i)/\sigma_i) so no single dimension dominates. after standardization, our martian's coordinates are order-1 instead of order-100, the updates stay sane, and the convergence bound tightens.

the recipe

this handles the engineering. but there's a deeper problem no amount of tuning can fix: what if the data can't be separated by any line?

1.9

escaping flatland (xor + why neural nets exist)

four points at the corners of a square. label them by diagonal: one diagonal pair is +1, the other is -1.

ABoutput
00-1
01+1
10+1
11-1

try to draw a line that separates them.

you can't. sweep the boundary through every angle — every orientation catches one +1 and one -1 on each side. the perceptron will oscillate forever, shoving the boundary back and forth, never settling.

this isn't bad luck. it's geometry. a line can only cut the plane into two half-planes, and no half-plane contains exactly one diagonal of a square. the +1s and -1s alternate around the corners.

the flashlight tells you why. the score w1A+w2Bw_1 A + w_2 B sees each input's contribution independently — it knows "A is big" and "B is big" as separate facts.

what it can't see is whether A and B agree. that's not a direction in the input space. it's a relationship between directions, and no weighted sum can compute it.

there are two ways out:

  1. change the representation (feature engineering)
  2. learn the representation (a neural network)
1.9.1

the feature-map escape hatch (make xor separable)

if the flashlight can't see the pattern, give it a new dimension to look along. not a random one — you need a feature that encodes the relationship the beam is missing.

for XOR, the missing fact is "both are on." that's literally ABA \cdot B.

define a feature map:

ϕ(A,B)=(1,A,B,AB)\phi(A,B) = (1, A, B, A\cdot B)

the mapped points are:

pointϕ(A,B)\phi(A,B)label
(0,0)(1,0,0,0)(1, 0, 0, 0)-1
(0,1)(1,0,1,0)(1, 0, 1, 0)+1
(1,0)(1,1,0,0)(1, 1, 0, 0)+1
(1,1)(1,1,1,1)(1, 1, 1, 1)-1

one separating weight vector is w=[1,2,2,4]\mathbf{w} = [-1, 2, 2, -4]. check:

translation: “if BOTH inputs are 1, subtract 4 from the score.” that feature makes XOR separable.

but this is the catch: you had to know to add ABA\cdot B. the perceptron can't see features it wasn't given — ABA \cdot B was in the beam's blind spot. it can only learn weights for the dimensions you hand it.

that ABA\cdot B term is the simplest interaction feature: it turns “both are on” into a single coordinate.

in the martian story, HWH\cdot W is the same move. it gives you a knob for “tall and heavy” that a purely linear score can’t express directly.

1.9.2

the question for next time

the feature map worked. but notice what it cost us: we had to know that ABA\cdot B was the right feature.

for XOR that's trivial — two inputs, one pairwise interaction to try. for our martians, we could guess HWH\cdot W. but for a real problem with a hundred measurements, the pairwise interactions alone number (1002)=4,950\binom{100}{2} = 4{,}950. three-way? 161,700161{,}700. you're not hand-crafting your way out of that.

the perceptron can learn any boundary in the space you give it. but it can't see past that space. if the feature that matters is a combination you didn't think to include, no amount of weight-nudging will find it. the flashlight can only illuminate dimensions that exist.

what if the machine could build its own dimensions?

that's the idea behind a neural network. instead of you designing ABA\cdot B and hoping, you give the machine a stack of layers. each layer computes weighted sums — exactly like a perceptron — then bends the result through a nonlinearity (like clamping negatives to zero). the early layers combine raw inputs into new features. the later layers combine those into higher-order ones.

the network doesn't know it needs ABA\cdot B. it builds something that works like ABA\cdot B because the training signal rewards it. the feature map isn't designed. it's grown.

and the beautiful part: the final layer is still a dot product. everything we built in this essay — the score, the boundary, the sign check — is still there at the output. a neural network is a perceptron whose inputs are learned features rather than raw measurements.

but stacking layers creates a problem the perceptron never had.

in a single perceptron, when the machine miscalls a chonker, we know exactly who to blame: the one weight vector w\mathbf{w}. we nudge it toward yxy\mathbf{x} and move on. one knob, one gradient, one step.

in a network with three layers and thousands of weights, the output depends on all of them. when the prediction is wrong, which layer screwed up? the first layer, which built a bad feature? the second, which combined good features badly? the last, which drew the boundary in the wrong place? every weight had a hand in the output, and each one deserves a different share of the blame.

that's the credit assignment problem, and solving it is the subject of the next essay.

1.10

tl;dr (for the adhd crew)

  1. the machine: weighted sum of inputs → is it positive?
  2. the problem: we don't know the right weights
  3. the fix: when wrong, ww+ηyx{\ww \mathbf{w}} \leftarrow {\ww \mathbf{w}} + \eta y {\xx \mathbf{x}}
  4. why it helps: margin increases by ηx2\eta \|{\xx \mathbf{x}}\|^2 on the mistake example
  5. the catch: convergence needs separability (otherwise it can thrash)
  6. the deeper catch: if your representation is wrong (xor), no amount of linear sweat will save you
  7. the open question: neural nets learn their own features — but when the output is wrong, which weights do you blame? (next essay: backpropagation)
1.11

exercises (pencil + paper)

the goal: you should be able to do these with a pencil, and you should know you're right at the end.

1One Update, No Excuses
Question
  • start with w=[2,1,0.5]\mathbf{w} = [2, -1, 0.5] (bias, height, weight)
  • one martian: x=[1,4,3]\mathbf{x} = [1, 4, 3], true label y=+1y = +1
  • learning rate: η=1\eta = 1

do this by hand:

  1. compute score=wx\text{score} = \mathbf{w}^\top \mathbf{x} and the predicted label
  2. if it's wrong, apply the update ww+ηyx\mathbf{w} \leftarrow \mathbf{w} + \eta y \mathbf{x}
  3. recompute the score with the new weights

score = 2+(1)4+0.53=0.52 + (-1)\cdot 4 + 0.5 \cdot 3 = -0.5, so prediction is -1 (wrong). update:

wnew=[2,1,0.5]+(+1)[1,4,3]=[3,3,3.5]\mathbf{w}_{new} = [2, -1, 0.5] + (+1)[1,4,3] = [3, 3, 3.5]

new score = 3+34+3.53=25.53 + 3\cdot 4 + 3.5\cdot 3 = 25.5 (positive), so it flips to +1. you can also verify the margin jump:

mnew=mold+ηx2=(0.5)+(12+42+32)=25.5m_{new} = m_{old} + \eta \|\mathbf{x}\|^2 = (-0.5) + (1^2 + 4^2 + 3^2) = 25.5
2Bias Matters
Question

consider two points with labels:

  • xA=[1,2]\mathbf{x}_A = [1, 2], yA=+1y_A = +1
  • xB=[1,5]\mathbf{x}_B = [1, 5], yB=1y_B = -1

now do two mini-cases:

  1. no bias term: you are forced to use score=w1x\text{score} = w_1 x (boundary must pass through 0)
  2. with bias: you can use score=w0+w1x\text{score} = w_0 + w_1 x

with no bias, the boundary is w1x=0w_1 x = 0, so x=0x = 0. both points have positive x, so they always land on the same side and can't be separated.

with bias, pick any threshold between 2 and 5, e.g. boundary x=3.5x = 3.5. one valid choice:

w0=3.5,  w1=1w_0 = 3.5,\; w_1 = -1

then scoreA=3.52=1.5>0\text{score}_A = 3.5 - 2 = 1.5 > 0 (A is +1) and scoreB=3.55=1.5<0\text{score}_B = 3.5 - 5 = -1.5 < 0 (B is -1).

3When It Doesn't Converge (Pocket vs Averaged)
Question

this dataset is not separable on purpose: two examples share identical features but disagree on the label.

use a 2d input with bias: x=[1,x1,x2]\mathbf{x} = [1, x_1, x_2].

training set (in this exact order), with η=1\eta = 1 and starting weights w=[0,0,0]\mathbf{w} = [0,0,0]:

  • A: xA=[1,1,1]\mathbf{x}_A = [1, 1, 1], yA=+1y_A = +1
  • B: xB=[1,1,2]\mathbf{x}_B = [1, -1, 2], yB=+1y_B = +1
  • C: xC=[1,1,1]\mathbf{x}_C = [1, 1, 1], yC=1y_C = -1 (label noise: same features as A, opposite label)

your job:

  1. run one epoch and write down w\mathbf{w} after A, after B, and after C
  2. compute the pocket weight: among the weights you visited, which one makes the fewest mistakes on A/B/C?
  3. compute the averaged weight: wˉ\bar{\mathbf{w}} = average of the three post-step weights, and count its mistakes on A/B/C

run the epoch:

  • start w=[0,0,0]\mathbf{w} = [0,0,0]
  • A is misclassified (score = 0 → -1), so update: ww+(+1)[1,1,1]=[1,1,1]\mathbf{w} \leftarrow \mathbf{w} + (+1)[1,1,1] = [1,1,1]
  • B is correct under [1,1,1][1,1,1] (score = 1+(1)+2=21 + (-1) + 2 = 2), so no update (still [1,1,1][1,1,1])
  • C is misclassified under [1,1,1][1,1,1] (score = 3 → +1 but y=1y=-1), so update: w[1,1,1]+(1)[1,1,1]=[0,0,0]\mathbf{w} \leftarrow [1,1,1] + (-1)[1,1,1] = [0,0,0]

so the visited weights are:

  • after A: [1,1,1][1,1,1]
  • after B: [1,1,1][1,1,1]
  • after C: [0,0,0][0,0,0]

pocket: evaluate mistakes on A/B/C.

  • with [1,1,1][1,1,1]: A ✓, B ✓, C ✗ → 1 mistake
  • with [0,0,0][0,0,0]: A ✗, B ✗, C ✓ → 2 mistakes

so pocket keeps [1,1,1][1,1,1].

averaged: wˉ=[1,1,1]+[1,1,1]+[0,0,0]3=[2/3,2/3,2/3]\bar{\mathbf{w}} = \frac{[1,1,1] + [1,1,1] + [0,0,0]}{3} = [2/3, 2/3, 2/3]. this predicts A ✓, B ✓, C ✗ → 1 mistake.

the moral: when the data is non-separable, the last iterate can be garbage. pocket/average gives you something stable to ship.

BonusEscape the Number Line
Question

here's a 1D dataset:

xxlabel
0-1
1+1
2+1
3-1

the +1s are sandwiched between the -1s.

your job:

  1. explain why no single threshold w0+w1x=0w_0 + w_1 x = 0 can separate these points
  2. propose a feature map ϕ(x)\phi(x) that makes them separable (hint: think quadratic)
  3. find a weight vector w\mathbf{w} that works, and verify it on all four points

a single threshold splits the number line into two half-lines: everything left of the cut, and everything right. but the +1s live in the middle. you need a feature that "closes" around the +1s — something that's big in the middle and small at the edges (or vice versa).

1. a linear boundary w0+w1x=0w_0 + w_1 x = 0 is a single cut point on the number line. one side is +1, the other is -1. but the +1 region here is an interval [1,2][1,2], not a half-line. you'd need two cuts — one threshold can't do it.

2. use ϕ(x)=(1,x,x2)\phi(x) = (1, x, x^2). the quadratic term lets the boundary curve.

3. try w=[1,3,1]\mathbf{w} = [-1, 3, -1]. compute wϕ(x)=1+3xx2\mathbf{w}^\top \phi(x) = -1 + 3x - x^2:

  • x=0x=0: 1+00=1-1 + 0 - 0 = -1 → -1 ✓
  • x=1x=1: 1+31=1-1 + 3 - 1 = 1 → +1 ✓
  • x=2x=2: 1+64=1-1 + 6 - 4 = 1 → +1 ✓
  • x=3x=3: 1+99=1-1 + 9 - 9 = -1 → -1 ✓

the quadratic x2+3x1-x^2 + 3x - 1 is a downward parabola — positive at x=1x=1 and x=2x=2, negative at the endpoints. the perceptron can't invent x2x^2 — you had to add it — but once you do, the problem is linearly separable in the lifted space.

Bonus 2Implement the Loop
Question

using the 5-bullet recipe from §1.8 as your spec, write the full training loop. here's a small dataset to test it on:

  • A = (1,1,1)(1, 1, 1), y=+1y = +1
  • B = (1,2,1)(1, 2, 1), y=+1y = +1
  • C = (1,0.5,2)(1, 0.5, 2), y=1y = -1

(already includes bias.) run 3 full passes through the data with η=1\eta = 1, processing in order A, B, C each pass. report:

  1. final weights
  2. pocket weights (best seen)
  3. averaged weights

the recipe bullets are the spec. pocket = keep the weight vector with fewest mistakes on the full dataset. average = running sum of every weight vector seen, divided by total steps.

w = [0, 0, 0]
w_sum = [0, 0, 0]; t = 0
w_pocket = w; best_mistakes = inf

for epoch in 1..N:          # one pass through all points
  # (shuffle in production; fixed order here for hand-tracing)
  mistakes = 0
  for (x, y) in data:
    score = dot(w, x)
    pred = +1 if score > 0 else -1
    if pred != y:
      w = w + eta * y * x
      mistakes += 1
    w_sum += w; t += 1

  total_mistakes = count_mistakes(w, all_data)
  if total_mistakes < best_mistakes:
    w_pocket = w; best_mistakes = total_mistakes
  if mistakes == 0: break

w_avg = w_sum / t
return w, w_pocket, w_avg

hand-trace, pass 1 (w starts at [0,0,0][0,0,0]):

steppointwx\mathbf{w}^\top \mathbf{x}predlabelmistake?w\mathbf{w} after
1A (1,1,1)(1,1,1), y=+1y=+1001-1+1+1yes[1,1,1][1,1,1]
2B (1,2,1)(1,2,1), y=+1y=+11+2+1=41+2+1=4+1+1+1+1no[1,1,1][1,1,1]
3C (1,0.5,2)(1,0.5,2), y=1y=-11+0.5+2=3.51+0.5+2=3.5+1+11-1yes[0,0.5,1][0, 0.5, -1]

after pass 1: w=[0,0.5,1]\mathbf{w} = [0, 0.5, -1]. check all points:

  • A: 0+0.51=0.50 + 0.5 - 1 = -0.5 → pred 1-1, label +1+1 — wrong
  • B: 0+11=00 + 1 - 1 = 0 → pred 1-1, label +1+1 — wrong
  • C: 0+0.252=1.750 + 0.25 - 2 = -1.75 → pred 1-1, label 1-1 — correct

2 mistakes on full dataset. pocket stores [0,0.5,1][0, 0.5, -1] with 2 mistakes (better than \infty, the starting count).

continue passes 2-3 the same way. your implementation should reproduce this trace for pass 1.

pass 2 (w starts at [0,0.5,1][0, 0.5, -1]):

steppointwx\mathbf{w}^\top \mathbf{x}predlabelmistake?w\mathbf{w} after
4A0+0.51=0.50 + 0.5 - 1 = -0.51-1+1+1yes[1,1.5,0][1, 1.5, 0]
5B1+3+0=41 + 3 + 0 = 4+1+1+1+1no[1,1.5,0][1, 1.5, 0]
6C1+0.75+0=1.751 + 0.75 + 0 = 1.75+1+11-1yes[0,1,2][0, 1, -2]

end of pass 2: w=[0,1,2]\mathbf{w} = [0, 1, -2]. check all: A 1-1 (wrong), B 010 \to -1 (wrong), C 3.5-3.5 (correct). 2 mistakes — no pocket improvement.

pass 3 (w starts at [0,1,2][0, 1, -2]):

steppointwx\mathbf{w}^\top \mathbf{x}predlabelmistake?w\mathbf{w} after
7A0+12=10 + 1 - 2 = -11-1+1+1yes[1,2,1][1, 2, -1]
8B1+41=41 + 4 - 1 = 4+1+1+1+1no[1,2,1][1, 2, -1]
9C1+12=01 + 1 - 2 = 01-11-1no[1,2,1][1, 2, -1]

end of pass 3: w=[1,2,1]\mathbf{w} = [1, 2, -1]. check all: A =2= 2 ✓, B =4= 4 ✓, C =01= 0 \to -1 ✓. 0 mistakes. pocket updates to [1,2,1][1, 2, -1].

final answers:

  1. final weights: [1,2,1][1, 2, -1]
  2. pocket weights: [1,2,1][1, 2, -1] (0 mistakes — converged, so pocket = final)
  3. averaged weights: wˉ=19t=19wt=[7/9,  12.5/9,  4/9][0.78,  1.39,  0.44]\bar{\mathbf{w}} = \frac{1}{9}\sum_{t=1}^{9} \mathbf{w}_t = [7/9,\; 12.5/9,\; -4/9] \approx [0.78,\; 1.39,\; -0.44] — gets A and B right but misclassifies C (1 mistake)