Build A Three Layer Neural Network From Scratch
This article describes how to design a three layer neural network from scratch. The network learns from MNIST and is capable of detecting the digit of any inout image after training.
Story: Why did I write this?
I started to read Deep Learning From Scratch written by 斎藤 康毅 two years ago. But I was stucked when reading the back propagation part at that time.
Finally I finished the book within one week last summer with a stronger interest in Deep Learning(DL). Though a set of good example code was given by the book to implement network with Python, I was too lazy to practice it at that moment.
Lately I have a chance in my personal project to construct such a network, the only limitation is that I can only use MATLAB to do so.
For some reasons, code will not be shown in this article, but I will explain the math behind the code. If you are interested in the code, please feel free to contact me.
Mathematical Description
Let’s start from explaining this in math.
Structure of the Three Layer Neural Network
 Input Layer: 784 neurons (since MNIST images are 28x28 pixels, flattening them results in 784 input features).
 Hidden Layer: 124 neurons.
 Output Layer: 10 neurons(corresponding to number “0”“9”)
Activation Function
The sigmoid function $ \sigma(x)$ is defined as: $$ \sigma(x) = \frac{1}{1 + e^{x}} $$
Sigmoid Derivative
The derivative of the sigmoid function $sigma(x) $ is: $$ \sigma’(x) = \sigma(x) \cdot (1  \sigma(x)) $$
For a neural network with $L$ layers, where each layer $l$ has weights $W^{(l)}$ and biases $b^{(l)}$, and using the sigmoid activation function, the steps are as follows:
Forward Propagation
For each layer $l$:

Compute the weighted sum: $$ z^{(l)} = {W^{(l)}}^{T} a^{(l1)} + b^{(l)} $$

Apply the sigmoid activation function: $$ a^{(l)} = \sigma(z^{(l)}) $$
Performance Index
We use total squared error (TSE) as the performance index: $$ \text{TSE} = \sum (a^{(L)}  y)^2 $$
Output Layer Error
$$ \delta^{(L)} = a^{(L)}  y $$
Backward Propagation
Calculate the Sensitivity (S) for the Output Layer
$$ S_L = 2 A_L e_L $$
Backpropagate the Error through the Layers (Option 1: Heuristic Unscaled Back Propagation)
$$ e_{l1} = W_l e_l \quad \text{and} \quad S_{l1} = 2 A_{l1} e_{l1} $$
Backpropagate the Error through the Layers (Option 2: Calculusbased Back Propagation)
$$ S_{l1} = A_{l1} W_l S_l $$
Gradients
For each layer $l$ from $L1$ to 1: $$ \nabla W^{(l)} = \delta^{(l)} (a^{(l1)}){S^{(l)}}^{T} $$
$$ \nabla b^{(l)} = {S^{(l)}} $$
Update Rule
Using gradient descent, the weights and biases are updated as follows: $$ W^{(l)} := W^{(l)}  \eta \nabla W^{(l)} $$
$$ b^{(l)} := b^{(l)}  \eta \nabla b^{(l)} $$
where $\eta$ is the learning rate.
TPR & FPR
True Positive Rate (TPR) and False Positive Rate (FPR) are critical metrics in evaluating the performance of a classifier, particularly in binary classification tasks.
 TPR measures the proportion of actual positives correctly identified by the classifier. A high TPR indicates that the classifier is effective in identifying positive instances.
 FPR measures the proportion of actual negatives incorrectly classified as positives. A low FPR indicates that the classifier makes fewer false positive errors.
These metrics are plotted against each other to form the Receiver Operating Characteristic (ROC) curve, which visualizes the tradeoff between sensitivity and specificity (1  FPR) across different thresholds.
Heuristic Unscaled Back Propagation
t  TPR  FPR 

0.25  98%  7% 
0.50  94%  6% 
0.75  93%  2% 
Calculusbased Back Propagation
t  TPR  FPR 

0.25  98%  11% 
0.50  92%  8% 
0.75  89%  2% 
ROC
where t $\in$ {0.1, 0.2, …, 0.9}
AUC
The Area Under the Curve (AUC) of the ROC curve is a single scalar value that summarizes the performance of the classifier across all possible thresholds. The AUC provides several advantages:
 Interpretability: AUC values range from 0 to 1, where 1 indicates a perfect classifier, and 0.5 suggests a performance no better than random guessing. An AUC closer to 1 indicates a betterperforming model.
 Threshold Independence: Unlike TPR and FPR, which depend on a specific threshold, the AUC evaluates the classifier’s performance across all thresholds, providing a more comprehensive measure.
Metrics of AUC
 AUC = 1 indicates a perfect classifier.
 AUC = 0.5 suggests a classifier with no discriminative power, equivalent to random guessing.
 AUC < 0.5 indicates a classifier performing worse than random guessing, which is rarely the case in practical scenarios.
The higher the AUC, the better the classifier’s overall performance.
Results
we used trap() function to calculate AUC, results are shown as follows:
Three Layer Neural Network with Heuristic Unscaled Back Propagation: AUC = 0.1227
Three Layer Neural Network with Calculusbased Back Propagation: AUC = 0.1491
Five Layer Neural Network with Heuristic Unscaled Back Propagation: AUC = 0.0379
Five Layer Neural Network with Calculusbased Back Propagation: AUC = 0.1122
Conclusions
The given results suggest that the calculusbased backpropagation algorithms generally perform better than the heuristic unscaled backpropagation algorithms for both the threelayer and fivelayer neural networks, as indicated by their higher AUC values. However, all models show relatively low AUC values, suggesting room for improvement in classifier design or training.