Solving XOR with a Multi-Layer Perceptron


In the first perceptron post, a single neuron learned the logical OR function. That worked because OR is linearly separable: one straight line can split the 0 case from the 1 cases.

Now we will solve the case that breaks a single perceptron: XOR.

XOR returns 1 only when the two inputs are different:

x1x2XOR
000
011
101
110

There is no single straight line that separates those two 1 points from the two 0 points. To solve it, we need a network that can combine multiple simpler boundaries. That is what a multi-layer perceptron gives us.

XOR on an X/Y plane

Plot the two inputs as coordinates: x1 on the X-axis and x2 on the Y-axis.

XOR outputs plotted on a Cartesian plane

The diagonal corners have the same output: (0,0) and (1,1) are 0, while (0,1) and (1,0) are 1. That alternating pattern is why one perceptron fails. A perceptron can model only one staight line. Any single straight line that captures one 1 point will leave the other 1 point on the wrong side. To separate 1s and 0s in this case we need two separate lines.

The idea

For XOR, two hidden neurons are enough. The hidden layer learns useful intermediate features, and the output neuron combines those features into the final answer.

We will use sigmoid activations instead of the identity function:

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

XOR does not require sigmoid specifically, but it does require some non-linearity. If every layer used the identity function, the hidden layer would be:

hidden=XW1+b1output=(hidden)W2+b2\begin{aligned} \text{hidden} &= XW_1 + b_1 \\[0.6em] \text{output} &= (\text{hidden})W_2 + b_2 \end{aligned}

Substitute hidden into output:

output=(XW1+b1)W2+b2output=X(W1W2)+(b1W2+b2)\begin{aligned} \text{output} &= (XW_1 + b_1)W_2 + b_2 \\[0.6em] \text{output} &= X(W_1W_2) + (b_1W_2 + b_2) \end{aligned}

That collapses the whole network into one linear model:

output=XW+b\text{output} = XW + b

So an identity-only MLP has the same limitation as a single perceptron: it still cannot separate XOR. Sigmoid is just a convenient non-linear choice here because it is smooth, easy to differentiate, and works well with backpropagation. ReLU or tanh could also work.

The data

import numpy as np
import numpy.typing as npt

inputs: npt.NDArray[np.float64] = np.array(
    [[0, 0], [0, 1], [1, 0], [1, 1]],
    dtype=np.float64,
)

output: npt.NDArray[np.float64] = np.array(
    [[0], [1], [1], [0]],
    dtype=np.float64,
)

learning_rate = 0.5
epochs = 10_000

The inputs are still the same four binary pairs. The output is now the XOR truth table.

The network

This network has:

  1. Two input values: x1 and x2
  2. Two hidden neurons
  3. One output neuron
input_size = 2
hidden_size = 2
output_size = 1

w1 = np.array([[0.3, 0.8], [0.7, 0.2]])
b1 = np.zeros((1, hidden_size))

w2 = np.array([[0.6], [-0.4]])
b2 = np.zeros((1, output_size))

w1 connects the input layer to the hidden layer. w2 connects the hidden layer to the output layer. Each layer also has its own bias.

Notice that the weights start at different, non-zero values. That matters. If we set every weight to zero (or any single shared value), both hidden neurons would compute the same output, receive the same gradient during backpropagation, and update identically forever. They would stay perfect copies of each other and never learn the two distinct features XOR needs. This is the symmetry problem, and it is why weights are initialised to different values, usually small random numbers. The biases can safely start at zero because the differing weights are already enough to break the symmetry.

Forward pass

The forward pass is the network making a prediction:

def sigmoid(x: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
    return x * (1 - x)

def predict(
    x: npt.NDArray[np.float64],
) -> tuple[npt.NDArray[np.float64], npt.NDArray[np.float64]]:
    hidden = sigmoid(x @ w1 + b1)
    prediction = sigmoid(hidden @ w2 + b2)
    return hidden, prediction

The @ operator is matrix multiplication. So x @ w1 + b1 computes the hidden layer’s weighted sums, and hidden @ w2 + b2 computes the output neuron’s weighted sum.

Backpropagation

Backpropagation answers one question: which weights contributed to the error, and by how much?

For each training pass, we:

  1. Run a forward pass.
  2. Measure the output error.
  3. Push that error backward through the network.
  4. Update the weights and biases.
def train() -> None:
    global w1, b1, w2, b2

    for _ in range(epochs):
        hidden, predictions = predict(inputs)

        output_error = output - predictions
        output_delta = output_error * sigmoid_derivative(predictions)

        hidden_error = output_delta @ w2.T
        hidden_delta = hidden_error * sigmoid_derivative(hidden)

        w2 += learning_rate * hidden.T @ output_delta
        b2 += learning_rate * np.sum(output_delta, axis=0, keepdims=True)

        w1 += learning_rate * inputs.T @ hidden_delta
        b1 += learning_rate * np.sum(hidden_delta, axis=0, keepdims=True)

train()

The output layer is updated first because its error is closest to the prediction. Then that output error is sent backward into the hidden layer, letting the hidden weights learn what they should have detected.

That is the big jump from the first perceptron. The single perceptron could only update one direct set of weights. The MLP can assign credit and blame across layers.

Trying it out

Now we can ask the network for the XOR table:

_, predictions = predict(inputs)

print(np.round(predictions, 3))
print((predictions > 0.5).astype(int))

A typical output looks like this:

[[0.019]
 [0.983]
 [0.983]
 [0.017]]

[[0]
 [1]
 [1]
 [0]]

The raw sigmoid values are probabilities close to 0 or 1. Thresholding them at 0.5 gives the exact XOR answers.

Complete code

Here is the whole script in one place:

import numpy as np
import numpy.typing as npt

inputs: npt.NDArray[np.float64] = np.array(
    [[0, 0], [0, 1], [1, 0], [1, 1]],
    dtype=np.float64,
)

output: npt.NDArray[np.float64] = np.array(
    [[0], [1], [1], [0]],
    dtype=np.float64,
)

learning_rate = 0.5
epochs = 10_000

w1 = np.array([[0.3, 0.8], [0.7, 0.2]])
b1 = np.zeros((1, 2))
w2 = np.array([[0.6], [-0.4]])
b2 = np.zeros((1, 1))

def sigmoid(x: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x: npt.NDArray[np.float64]) -> npt.NDArray[np.float64]:
    return x * (1 - x)

def predict(
    x: npt.NDArray[np.float64],
) -> tuple[npt.NDArray[np.float64], npt.NDArray[np.float64]]:
    hidden = sigmoid(x @ w1 + b1)
    prediction = sigmoid(hidden @ w2 + b2)
    return hidden, prediction

def train() -> None:
    global w1, b1, w2, b2

    for _ in range(epochs):
        hidden, predictions = predict(inputs)

        output_error = output - predictions
        output_delta = output_error * sigmoid_derivative(predictions)

        hidden_error = output_delta @ w2.T
        hidden_delta = hidden_error * sigmoid_derivative(hidden)

        w2 += learning_rate * hidden.T @ output_delta
        b2 += learning_rate * np.sum(output_delta, axis=0, keepdims=True)

        w1 += learning_rate * inputs.T @ hidden_delta
        b1 += learning_rate * np.sum(hidden_delta, axis=0, keepdims=True)

train()

_, predictions = predict(inputs)

print(np.round(predictions, 3))
print((predictions > 0.5).astype(int))