Energy-based model

An Energy-based model (EBM), sometimes referred to as energy-based probabilistic model (EBPM), is a form of a generative machine learning models closely related to physics and used as an unsupervised learning method in natural language processing and computer vision.

# Motivation

An EBM learns the characteristics of a target dataset and generates a similar but larger dataset. EBMs detect the latent variables of a dataset and generate new datasets with a similar distribution. EBMs are significant to information geometry insofar as they are not unitary, that is they do not require that energies be normalized as probabilities, (energies do not need to sum to 1). Since there is no need to estimate the normalization constant like probabilistic models do, certain forms of inference and learning with EBMs are more tractable and flexible.

Traditional EBMs rely on stochastic gradient-descent (SGD) optimization methods that are typically hard to apply to high-dimension datasets. In 2019, OpenAI publicized a variant that instead used Langevin dynamics: an iterative optimization algorithm that introduces noise to the estimator as part of learning an objective function. It can be used for Bayesian learning scenarios by producing samples from a posterior distribution.

Energy-based models tie together the conceptual picture of statistical mechanics of deep learning, and express differences in Gibbs and Helmholtz free energies as exactly Kullback–Leibler divergence in information theory, thus bridging the Boltzmann entropy in thermodynamics and Rényi entropies (such as Shannon entropy) in information theory.

# Overview

An energy based model is specified as a Boltzmann distribution (with the Boltzmann factor $kT = 1$):

$$p(\mathbf{x}; \mathbf{w}) = \frac{1}{Z_\mathbf{w}}e^{E(\mathbf{x}; \mathbf{w})}$$

The earliest energy-based probabilistic models in machine learning were in fact called Boltzmann machines, and map directly onto Ising spin models with a learned coupling structure $\mathbf{w}$. Inserting the Boltzmann form into the log-likelihood learning objective $l(\mathbf{w}) = \int q(\mathbf{x}) \log(\mathbf{x}; \mathbf{w}) \mathrm{d}\mathbf{x}$ yields:

$$-l(\mathbf{w}) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - F_\mathbf{w}$$

Where $\langle \cdot \rangle_q$ denotes an average with respect to the data distribution $q(\mathbf{x})$ and $F_\mathbf{w} = -\log Z_\mathbf{w}$ is the Helmholtz free energy of the model distribution $p(\mathbf{x}; \mathbf{w})$. Thus, learning via maximizing the log-likelihood corresponds to minimizing the energy of observed data while increasing overall free energy of the model distribution. Maximizing $l(\mathbf{w})$ is also equivalent to minimizing the Kullback-Leibler divergence,

$$D_\mathrm{KL}(q\| p) = \int q(\mathbf{x}) \log \left( \frac{q(\mathbf{x})}{p(\mathbf{x}; \mathbf{w})}\right)\mathrm{d}\mathbf{x} = G_\mathbf{w}(q) - F_\mathbf{w}$$

Kullback-Leibler divergence $D_\mathrm{KL}(q \| p)$ is nonnegative measure of the divergence between two distributions $q$ and $p$ that is zero if and only if $q=p$. In the special case when $p$ takes the Boltzmann form, the KL divergence becomes the difference between the Gibbs free energy of $q$, defined as $G_\mathbf{w}(q) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - S(q)$ (where $S(q) = -\int q(\mathbf{x})\log q(\mathbf{x}) \mathrm{d}\mathbf{x}$ is the entropy of $q$) and the Helmholtz free energy $F_\mathbf{w}$ of $p$.

Learning then corresponds to fixing the data distribution $q$ and optimizing model parameters $w$ in $D_\mathrm{KL}(q \| p)(\mathbf{w})|_{q = q'} = G_\mathbf{w}(q) - F_\mathbf{w}$ as in expression above. However this decomposition has another widespread application in both machine learning and statistical mechanics. Often we are given a fixed Boltzmann distribution with coupling parameters $\mathbf{w}$, that we would like to approximate with a simpler variational distribution $q(\mathbf{x})$, such an approximation can be derived by fixing $\mathbf{w}$ and equivalently minimizing with respect to $q$ either the KL divergence $D_\mathrm{KL}(q \| p)$ or the Gibbs free energy $G_\mathbf{w}(q)$. This approach leads to both variational inference in machine learning and variational mean-field methods in equilibrium statistical mechanics.