Overall Framework of Supervised Learning
Formally, a simplest version of a feed-forward neural network with $D$ layers is specified by $D$ weight matrices $\mathbf{W}^1, \cdots, \mathbf{W}^D$ and $D$ layers of neural activity vectors $\mathbf{x}^1, \cdots, \mathbf{x}^D$, with $N_l$ neurons in each layer $l$, so that $\mathbf{x}^l \in \mathbb{R}^{N_l}$ and $\mathbf{W}^l$ is an $N_l \times N_{l-1}$ matrix. Feed-forward dynamics are elicited by an input $\mathbf{x}^0$ presented to the network is given as:
$$\mathbf{x}^l = \phi(\mathbf{h}^l)$$
Where, $\mathbf{h}^l$ is a pattern of inputs to neurons at layer $l$ for $l = 1, \cdots, D$, with vector of biases $\mathbf{b}^l$.
$$\mathbf{h}^l = \mathbf{W}^l \mathbf{x}^{l-1} + \mathbf{b}^l$$
So that $\phi$ is a single neuron scalar non-linearity that acts componentwise to transform inputs $\mathbb{h}^l$ into activity $\mathbb{x}^l$.
Denote all $N$ neural network parameters: $\{\mathbf{W}^l, \mathbf{b}^l\}^D_{l=1}$ by the $N$-dimensional parameter vector [[$\mathbf{w}$, and the final output of the network in response to the input $\mathbf{x}^0$ by the vector $\mathbf{y} = \mathbf{x}^D(\mathbf{x}^0, \mathbf{w})$, with function $\mathbf{x}^D$ defined recursively as above.
A supervised learning task is specified by a joint distribution $\mathcal{P}(\mathbf{x}^0, \mathbf{y})$ over possible inputs $\mathbf{x}^0$ and outputs $\mathbf{y}$. A key goal of a supervised learning task is to find an optimal set of parameters that minimizes the test error on a randomly chosen input-output pair $(\mathbf{x}^0, \mathbf{y})$:
$$\mathcal{E}_\mathrm{Test} = \int_{\mathcal{X}^0 \times \mathcal{Y}} \mathcal{P}(\mathbf{x}^0, \mathbf{y})\; \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})\; \mathrm{d}\mathbf{x}^0 \mathrm{d}\mathbf{y}$$
Where the loss function $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$ penalizes discrepancy between the correct output $\mathcal{y}$ and network prediction $\hat{\mathcal{y}} = \mathbf{x}^D(\mathbf{x}^0, \mathbf{w})$. For example, a simple loss function is the squared loss $\mathcal{L} = \frac{1}{2}(\mathbf{y} - \hat{\mathbf{y}})^2$. In real world applications, it may not be possible to either directly access or even mathematically specify the data distribution $\mathcal{P}$. For example, in image classification, $\mathbf{x}^0$ could denote a vector of pixel intensities, whereas $\mathbf{y}$ could denote a probability distribution over image category labels. However, one can often access a finite dataset: $\mathcal{D} = \{\mathbf{x}^{0,\mu}, \mathbf{y}^\mu \}_{\mu=1}^P$ of $P$ independent identically distributed samples drawn from $\mathcal{P}$. One can then attempt to choose parameters $\mathbf{w}$ to minimize the training error:
$$\mathcal{E}_\mathrm{Train}(\mathbf{w}, \mathcal{D}) = \frac{1}{P} \sum_{\mu=1}^P \mathcal{L}(\mathbf{y}^\mu, \hat{\mathbf{y}}^\mu)$$
The training error $\mathcal{E}_\mathrm{Train}$ corresponds to the average mismatch between correct answers $\mathbf{y}^\mu$ and network predictions $\hat{\mathbf{y}}^\mu = \mathbf{x}^D(\mathbf{x}^{0,\mu}, \mathbf{w})$ on the specific training set $\mathcal{D}$. Many approaches to supervised learning attempt to minimize this training error, potentially with an additional cost function on $\mathbf{w}$ to promote generalization to accurate predictions on new inputs.
Stochastic Gradient Descent
Main article: Gradient descent
ToDo
Statistical Mechanics of Supervised Learning
|
Many methods for minimizing the training error involve descending the error landscape over the parameter vector $\mathbf{w}$ given by $\mathcal{E}_\mathrm{Train}(\mathbf{w},\mathcal{D})$ via (stochastic) gradient descent. Since $\mathcal{E}_\mathrm{Train}$ can be thought of as an energy function over thermal degrees of freedom $\mathbf{w}$, where the data $\mathcal{D}$ plays the role of quenched disorder. This makes the problems pertaining to error landscapes similar to problems in statistical mechanics of energy landscapes with quenched disorder, including phenomena like random Gaussian landscapes, spin glasses, and jamming.
Expressivity
Main article: Expressivity
Error Landscape
Main article: Error Landscape
Signal Propagation
Generalization
See also
|