# Energy-based model

**Energy-based probabilistic models** are closely related to physics, specified as a Boltzmann distribution (with the Boltzmann factor $kT = 1$):

$$p(\mathbf{x}; \mathbf{w}) = \frac{1}{Z_\mathbf{w}}e^{E(\mathbf{x}; \mathbf{w})}$$

The earliest energy-based probabilistic models in machine learning were in fact called Boltzmann machines, and map directly onto Ising spin models with a learned coupling structure $\mathbf{w}$. Inserting the Boltzmann form into the log-likelihood learning objective $l(\mathbf{w}) = \int q(\mathbf{x}) \log(\mathbf{x}; \mathbf{w}) \mathrm{d}\mathbf{x}$ yields:

$$-l(\mathbf{w}) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - F_\mathbf{w}$$

Where $\langle \cdot \rangle_q$ denotes an average with respect to the data distribution $q(\mathbf{x})$ and $F_\mathbf{w} = -\log Z_\mathbf{w}$ is the Helmholtz free energy of the model distribution $p(\mathbf{x}; \mathbf{w})$. Thus, learning via maximizing the log-likelihood corresponds to minimizing the energy of observed data while increasing overall free energy of the model distribution. Maximizing $l(\mathbf{w})$ is also equivalent to minimizing the Kullback-Leibler divergence,

$$D_\mathrm{KL}(q\| p) = \int q(\mathbf{x}) \log \left( \frac{q(\mathbf{x})}{p(\mathbf{x}; \mathbf{w})}\right)\mathrm{d}\mathbf{x} = G_\mathbf{w}(q) - F_\mathbf{w}$$

Kullback-Leibler divergence $D_\mathrm{KL}(q \| p)$ is nonnegative measure of the divergence between two distributions $q$ and $p$ that is zero if and only if $q=p$. In the special case when $p$ takes the Boltzmann form, the KL divergence becomes the difference between the Gibbs free energy of $q$, defined as $G_\mathbf{w}(q) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - S(q)$ (where $S(q) = -\int q(\mathbf{x})\log q(\mathbf{x}) \mathrm{d}\mathbf{x}$ is the entropy of $q$) and the Helmholtz free energy $F_\mathbf{w}$ of $p$.