|
Machine learning is an umbrella term for a field of study and application within computer science to develop algorithms and statistical models, enabling computers to learn from data and make predictions or decisions without explicit programming. It encompasses a diverse set of methodologies, including supervised, unsupervised, and reinforcement learning, all aimed at discovering patterns, relationships, and insights within complex datasets.
Table of Contents
|
Deep Learning
Deep learning is part of a broader family of machine learning methods, which is based on multi-layered artificial neural networks with representation learning. While most modern deep learning models are based on multi-layered artificial neural networks such as convolutional neural networks and transformers, they can also include propositional formulas or latent variables organized layer-wise in deep generative models such as the nodes in deep belief networks and deep Boltzmann machines.
Deep neural networks are generally interpreted in terms of the universal approximation theorem and probabilistic inference. The universal approximation theorem for deep neural networks concerns the capacity of networks with bounded width but the depth is allowed to grow. Lu et al.[1] proved that if the width of a deep neural network with ReLU activation is strictly larger than the input dimension, then the network can approximate any Lebesgue integrable function; if the width is smaller or equal to the input dimension, then a deep neural network is not a universal approximator.
The adjective "deep" in deep learning refers to the use of multiple layers in the network. Artificial neural networks (ANNs) were inspired by information processing and distributed communication nodes in biological systems. ANNs have various differences from biological brains. Specifically, artificial neural networks tend to be static and symbolic, while the biological brain of most living organisms is dynamic (plastic) and analog.
Deep neural network tasks are often categorized as discriminative (recognition) or generative (imagination). Often but not always, discriminative tasks use supervised methods and generative tasks use unsupervised; however, the separation is very hazy.
In overview:
- Supervised learning is used to make future predictions based on labeled data. It is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. Supervised learning models can be a valuable solution for eliminating manual classification.
- Unsupervised learning uses machine learning algorithms to analyze and cluster unlabeled datasets. Unsupervised algorithms discover hidden patterns or data groupings without the need for human intervention.
For example, object recognition favors supervised learning but unsupervised learning can also cluster objects into groups. Some tasks employ both methods in combination, or see both supervised and unsupervised approaches as competing paradigms. For example, image recognition started off as heavily supervised, but became hybrid by employing unsupervised pre-training, and then moved towards supervision again with the advent of dropout, ReLU, and adaptive learning rates.
Overall Framework of Supervised Learning
Main article: Supervised learning
Formally, a simplest version of a feed-forward neural network with $D$ layers is specified by $D$ weight matrices $\mathbf{W}^1, \cdots, \mathbf{W}^D$ and $D$ layers of neural activity vectors $\mathbf{x}^1, \cdots, \mathbf{x}^D$, with $N_l$ neurons in each layer $l$, so that $\mathbf{x}^l \in \mathbb{R}^{N_l}$ and $\mathbf{W}^l$ is an $N_l \times N_{l-1}$ matrix. Feed-forward dynamics are elicited by an input $\mathbf{x}^0$ presented to the network is given as:
$$\mathbf{x}^l = \phi(\mathbf{h}^l)$$
Where, $\mathbf{h}^l$ is a pattern of inputs to neurons at layer $l$ for $l = 1, \cdots, D$, with vector of biases $\mathbf{b}^l$.
$$\mathbf{h}^l = \mathbf{W}^l \mathbf{x}^{l-1} + \mathbf{b}^l$$
So that $\phi$ is a single neuron scalar non-linearity that acts componentwise to transform inputs $\mathbb{h}^l$ into activity $\mathbb{x}^l$.
Denote all $N$ neural network parameters: $\{\mathbf{W}^l, \mathbf{b}^l\}^D_{l=1}$ by the $N$-dimensional parameter vector [[$\mathbf{w}$, and the final output of the network in response to the input $\mathbf{x}^0$ by the vector $\mathbf{y} = \mathbf{x}^D(\mathbf{x}^0, \mathbf{w})$, with function $\mathbf{x}^D$ defined recursively as above.
A supervised learning task is specified by a joint distribution $\mathcal{P}(\mathbf{x}^0, \mathbf{y})$ over possible inputs $\mathbf{x}^0$ and outputs $\mathbf{y}$. A key goal of a supervised learning task is to find an optimal set of parameters that minimizes the test error on a randomly chosen input-output pair $(\mathbf{x}^0, \mathbf{y})$:
$$\mathcal{E}_\mathrm{Test} = \int_{\mathcal{X}^0 \times \mathcal{Y}} \mathcal{P}(\mathbf{x}^0, \mathbf{y})\; \mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})\; \mathrm{d}\mathbf{x}^0 \mathrm{d}\mathbf{y}$$
Where the loss function $\mathcal{L}(\mathbf{y}, \hat{\mathbf{y}})$ penalizes discrepancy between the correct output $\mathcal{y}$ and network prediction $\hat{\mathcal{y}} = \mathbf{x}^D(\mathbf{x}^0, \mathbf{w})$. For example, a simple loss function is the squared loss $\mathcal{L} = \frac{1}{2}(\mathbf{y} - \hat{\mathbf{y}})^2$. In real world applications, it may not be possible to either directly access or even mathematically specify the data distribution $\mathcal{P}$. For example, in image classification, $\mathbf{x}^0$ could denote a vector of pixel intensities, whereas $\mathbf{y}$ could denote a probability distribution over image category labels. However, one can often access a finite dataset: $\mathcal{D} = \{\mathbf{x}^{0,\mu}, \mathbf{y}^\mu \}_{\mu=1}^P$ of $P$ independent identically distributed samples drawn from $\mathcal{P}$. One can then attempt to choose parameters $\mathbf{w}$ to minimize the training error:
$$\mathcal{E}_\mathrm{Train}(\mathbf{w}, \mathcal{D}) = \frac{1}{P} \sum_{\mu=1}^P \mathcal{L}(\mathbf{y}^\mu, \hat{\mathbf{y}}^\mu)$$
The training error $\mathcal{E}_\mathrm{Train}$ corresponds to the average mismatch between correct answers $\mathbf{y}^\mu$ and network predictions $\hat{\mathbf{y}}^\mu = \mathbf{x}^D(\mathbf{x}^{0,\mu}, \mathbf{w})$ on the specific training set $\mathcal{D}$. Many approaches to supervised learning attempt to minimize this training error, potentially with an additional cost function on $\mathbf{w}$ to promote generalization to accurate predictions on new inputs.
Statistical Mechanics of Supervised Learning
Many methods for minimizing the training error involve descending the error landscape over the parameter vector $\mathbf{w}$ given by $\mathcal{E}_\mathrm{Train}(\mathbf{w},\mathcal{D})$ via (stochastic) gradient descent. Since $\mathcal{E}_\mathrm{Train}$ can be thought of as an energy function over thermal degrees of freedom $\mathbf{w}$, where the data $\mathcal{D}$ plays the role of quenched disorder. This makes the problems pertaining to error landscapes similar to problems in statistical mechanics of energy landscapes with quenched disorder, including phenomena like random Gaussian landscapes, spin glasses, and jamming.
Overall Framework of Unsupervised Learning
In addition to learning input-output tuple maps, another key branch of machine learning, known as unsupervised learning, concerns modeling and understanding the structure of complex data. For example, how can we describe the structure of natural images, sounds, and language? To accurately model probability distributions over such complex data allows for generation of naturalistic data. Of course distribution over such complex data as images or sounds cannot be mathematically specified, but we often have access to an empirical distribution of $P$ samples:
$$q(\mathbf{x}) = \frac{1}{P}\sum_{\mu=1}^P \delta(\mathbf{x}-\mathbf{x}^\mu)$$
Here for example, each $\mathbf{x}^\mu$ could denote a vector of pixel intensities for images, or a time series of pressure variations for a sound. The goal of unsupervised learning is to adjust parameters of $\mathbf{w}$ of a family of distributions $p(\mathbf{x}; \mathbf{w})$ to find one similar to the data distribution $q(\mathbf{x})$. This is often done by maximizing the log likelihood of the data with respect to model parameters $\mathbf{w}$:
$$l(\mathbf{w}) = \int_\mathcal{X} q(x) \log p(\mathbf{x}; \mathbf{w})\; \mathrm{d}\mathbf{x}$$
This learning principle modifies $p$ to assign high probability to data points, and consequently low probability elsewhere, thereby moving the model distribution $p(\mathbb{x}; \mathbb{w})$ closer to the data distribution $q(\mathbf{x})$.
See also: Entropy, free energy
Role of Latents in Unsupervised Learning
See also: Information theory, latent space, autoencoder
Once a good model $p(\mathbf{x}, \mathbf{w}$ is found, it has many uses. For example one can sample from it to imagine new data. One can also use it to denoise or fill in missing entries in the given data vector $\mathbf{x}$. Furthermore, if the distribution $p$ consists of a generative process that transforms the latent, (hidden variables $\mathbf{h}$), into the visible data vector $\mathbf{x}$, then the inferred latent variables $\mathbf{h}$ rather than $\mathbb{x}$ itself can aid in solving subsequent supervised learning tasks. This approach has been very successful, for example, in natural language processing, where the hidden layers of a network trained simply to generate language form useful internal representations for solving subsequent language processing tasks. Thus latent space of unsupervised learning model reflects features of a task-relevant structure in the dataset.
Statistical Mechanics of Unsupervised Learning
See also: Statistical mechanics, coupled map, Boltzmann machine
Interestingly, the process of choosing $p$ can be thought of as an inverse statistical mechanics problem. While traditionally, many problems of the theory of equilibrium statistical mechanics involve starting from a Boltzmann distribution $p(\mathbf{x}; \mathbf{w})$ over microstates $\mathbf{x}$, with couplings $\mathbf{w}$, and computing bulk statistics of $\mathbf{x}$ from $p$. In contrast, machine learning involves sampling from microstates $\mathbf{x}$ and deducing an appropriate distribution $p(\mathbf{x}; \mathbf{w})$.
Foundational Theoretical Questions in Deep Learning
Supervised Learning
- What is advantage of the depth $D$? In principle, what functions can be computed only for large $D$? This is similar to the problems of dynamical phase transitions between order and chaos.
- What is the shape of the error landscape? Many methods of minimizing $\mathcal{E}_\mathrm{Train}(\mathbf{w},\mathcal{D})$ involve descending the error landscape over parameter vector $\mathbf{w}$ via (stochastic) gradient descent.
- When can we descend to points of low training error? It is usually assumed that the error landscape will have a pointed hierarchical structure. This is not always the case. As discussed above, this is similar to problems of statistical mechanics with quenched disorder.
- How do we choose a good initialization point? When minimizing $\mathcal{E}_\mathrm{Train}(\mathbf{w},\mathcal{D})$ via gradient descent one must start at some initial $\mathbf{w}$, often initialized randomly. Can we choose the random initialization that would accelerate the subsequent gradient descent? This pertains to theories of signal propagation through random deep networks provide clues to good initializations, making connections to topics in random matrix theory, free probability, and functional path integrals.
- How to generalize from a dataset? Though many learning algorithms minimize $\mathcal{E}_\mathrm{Train}$, possibly with extra regularization on parameters $\mathbf{w}$, the key goal is to minimize the inaccessible test error $\mathcal{E}_\mathrm{Test}$ on a randomly chosen new input not necessarily present in the training data $\mathcal{D}$. It is then critical to achieve small generalization error $\mathcal{E}_\mathrm{Gen} = \mathcal{E}_\mathrm{Test} - \mathcal{E}_\mathrm{Train}$. When can we achieve such a small generalization error, especially in situations in which number of parameters $N$ can exceed the number of data points $P$. This is connected to topics in percolation theory, random matrix spectra, free field theories, interacting particle systems such as coupled maps.
Unsupervised Learning
On the unsupervised side, the theoretical development is less mature. However, work in deep unsupervised learning connects to ideas in equilibrium statistical mechanics, like free-energy minimization, as well as non-equilibrium statistical mechanics, like Jarzynski equality and heat dissipation in irreversible processes.
|