Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention.
During the learning phase, an unsupervised network tries to mimic the data it's given and uses the error in its mimicked output to correct itself (i.e. correct its weights and biases). Sometimes the error is expressed as a low probability that the erroneous output occurs, or it might be expressed as an unstable high energy state in the network.
In contrast to supervised methods' dominant use of backpropagation, unsupervised learning also employs other methods including: Hopfield learning rule, Boltzmann learning rule, Contrastive Divergence, Wake Sleep, Variational Inference, Maximum Likelihood, Maximum A Posteriori, Gibbs Sampling, and backpropagating reconstruction errors or hidden state reparameterizations. See the table below for more details.
Overall Framework of Unsupervised Learning
Unsupervised learning, concerns modeling and understanding the structure of complex data. For example, how can we describe the structure of natural images, sounds, and language? To accurately model probability distributions over such complex data allows for generation of naturalistic data. Of course distribution over such complex data as images or sounds cannot be mathematically specified, but we often have access to an empirical distribution of $P$ samples:
$$q(\mathbf{x}) = \frac{1}{P}\sum_{\mu=1}^P \delta(\mathbf{x}\mathbf{x}^\mu)$$
Here for example, each $\mathbf{x}^\mu$ could denote a vector of pixel intensities for images, or a time series of pressure variations for a sound. The goal of unsupervised learning is to adjust parameters of $\mathbf{w}$ of a family of distributions $p(\mathbf{x}; \mathbf{w})$ to find one similar to the data distribution $q(\mathbf{x})$. This is often done by maximizing the log likelihood of the data with respect to model parameters $\mathbf{w}$:
$$l(\mathbf{w}) = \int_\mathcal{X} q(x) \log p(\mathbf{x}; \mathbf{w})\; \mathrm{d}\mathbf{x}$$
This learning principle modifies $p$ to assign high probability to data points, and consequently low probability elsewhere, thereby moving the model distribution $p(\mathbb{x}; \mathbb{w})$ closer to the data distribution $q(\mathbf{x})$.
See also: Entropy, free energy
Role of Latents in Unsupervised Learning
See also: Information theory, latent space, autoencoder
Once a good model $p(\mathbf{x}, \mathbf{w}$ is found, it has many uses. For example one can sample from it to imagine new data. One can also use it to denoise or fill in missing entries in the given data vector $\mathbf{x}$. Furthermore, if the distribution $p$ consists of a generative process that transforms the latent, (hidden variables $\mathbf{h}$), into the visible data vector $\mathbf{x}$, then the inferred latent variables $\mathbf{h}$ rather than $\mathbb{x}$ itself can aid in solving subsequent supervised learning tasks. This approach has been very successful, for example, in natural language processing, where the hidden layers of a network trained simply to generate language form useful internal representations for solving subsequent language processing tasks. Thus latent space of unsupervised learning model reflects features of a taskrelevant structure in the dataset.
Statistical Mechanics of Unsupervised Learning
See also: Statistical mechanics, coupled map, Boltzmann machine
Interestingly, the process of choosing $p$ can be thought of as an inverse statistical mechanics problem. While traditionally, many problems of the theory of equilibrium statistical mechanics involve starting from a Boltzmann distribution $p(\mathbf{x}; \mathbf{w})$ over microstates $\mathbf{x}$, with couplings $\mathbf{w}$, and computing bulk statistics of $\mathbf{x}$ from $p$. In contrast, machine learning involves sampling from microstates $\mathbf{x}$ and deducing an appropriate distribution $p(\mathbf{x}; \mathbf{w})$.
