Energy-Based Probabilistic Models

Energy-based probabilistic models are most closely related to physics, specified as a Boltzmann distribution (with the Boltzmann factor $kT = 1$):

$$p(\mathbf{x}; \mathbf{w}) = \frac{1}{Z_\mathbf{w}}e^{E(\mathbf{x}; \mathbf{w})}$$

The earliest energy-based probabilistic models in machine learning were in fact called Boltzmann machines, and map directly onto Ising spin models with a learned coupling structure $\mathbf{w}$. Inserting the Boltzmann form into the log-likelihood learning objective $l(\mathbf{w}) = \int q(\mathbf{x}) \log(\mathbf{x}; \mathbf{w}) \mathrm{d}\mathbf{x}$ yields:

$$-l(\mathbf{w}) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - F_\mathbf{w}$$

Where $\langle \cdot \rangle_q$ denotes an average with respect to the data distribution $q(\mathbf{x})$ and $F_\mathbf{w} = -\log Z_\mathbf{w}$ is the Helmholtz free energy of the model distribution $p(\mathbf{x}; \mathbf{w})$. Thus, learning via maximizing the log-likelihood corresponds to minimizing the energy of observed data while increasing overall free energy of the model distribution. Maximizing $l(\mathbf{w})$ is also equivalent to minimizing the Kullback-Leibler divergence,

$$D_\mathrm{KL}(q\| p) = \int q(\mathbf{x}) \log \left( \frac{q(\mathbf{x})}{p(\mathbf{x}; \mathbf{w})}\right)\mathrm{d}\mathbf{x} = G_\mathbf{w}(q) - F_\mathbf{w}$$

Kullback-Leibler divergence $D_\mathrm{KL}(q \| p)$ is nonnegative measure of the divergence between two distributions $q$ and $p$ that is zero if and only if $q=p$. In the special case when $p$ takes the Boltzmann form, the KL divergence becomes the difference between the Gibbs free energy of $q$, defined as $G_\mathbf{w}(q) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - S(q)$ (where $S(q) = -\int q(\mathbf{x})\log q(\mathbf{x}) \mathrm{d}\mathbf{x}$ is the entropy of $q$) and the Helmholtz free energy $F_\mathbf{w}$ of $p$.

Learning then corresponds to fixing the data distribution $q$ and optimizing model parameters $w$ in $D_\mathrm{KL}(q \| p)(\mathbf{w})|_{q = q'} = G_\mathbf{w}(q) - F_\mathbf{w}$ as in expression above. However this decomposition has another widespread application in both machine learning and statistical mechanics. Often we are given a fixed Boltzmann distribution with coupling parameters $\mathbf{w}$, that we would like to approximate with a simpler variational distribution $q(\mathbf{x})$, such an approximation can be derived by fixing $\mathbf{w}$ and equivalently minimizing with respect to $q$ either the KL divergence $D_\mathrm{KL}(q \| p)$ or the Gibbs free energy $G_\mathbf{w}(q)$. This approach leads to both variational inference in machine learning and variational mean-field methods in equilibrium statistical mechanics.

Consider a deep feedforward network with $D$ layers of weights $\mathbf{W}_1, \cdots, \mathbf{W}_D$ and $D+1$ layers of neural activity vectors $\mathbf{x}_0, \cdots, \mathbf{x}_D$, with $N_l$ neurons in each layer $l$, so that $\mathbf{x}_l \in \mathbf{R}^{N_l}$ and $\mathbf{W}_L$ is an $N_l \times N_{l-1}$ weight matrix. Write feedforward dynamics elicited by an input $\mathbf{x}_0$ as given by single neuron scalar nonlinearity $\phi$:

$$\mathbf{x}_l = \phi(\mathbf{h}_l)$$

For $\mathbf{h}$ pattern of inputs at layer $l$, and $\mathbf{b}^l$ a vector of biases:

$$\mathbf{h}^l = \mathbf{W}^l \mathbf{x}^{l-1} + \mathbf{b}^l\; \text{for}\; l = 1, \cdots, D$$

In mean field theory propagation we wish to understand the nature of typical functions computable by such networks. Consider the general case in which synaptic weights $(\mathbf{W}_l)_{ij}$ are drawn i.d.d. from a zero mean Gaussian with variance $\partial^2_b$. This weight scaling ensures input contribution to each individual neuron at layer $l$ from activities of layer $l-1$ remains $O(1)$, independent of the layer width $N_{l-1}$. This ensemble constitutes a maximum entropy distribution over deep neural networks, subject to constraints on the means and variances of weights and biases, and induces no further structure in the resulting set of deep functions.

In the limit of large layer widths, $N_l \to \infty$, certain aspects of signal propagation through deep random networks take on a deterministic character. The deterministic limit enables us to understand how the Riemannian geometry in the input layer $\mathbf{x}_0$ is typically modified as the manifold propagates into deep layers. For example consider the simplest case of a single input vector $\mathbf{x}_0$. As it propagates through the network, it's length downstream will change. We track this change by computing normalized square length of the input vector at each layer:

$$q_l = \frac{1}{N_l} \sum^{N_l}_{i=1} \left( (\mathbf{h}_l)_i \right )^2$$

This length is the second moment of the empricial distribution of inputs, components of $\mathbf{h}_l$ across all $N_l$ neurons in layer $l$. At limit width $N_l \to \infty$ this distribution converges to a zero mean Gaussian since each $\mathbf{h}_l$ is a weighted sum of a large number of uncorrelated random variables, i.e. weights in $\mathbf{W}^l$ and biases $\mathbf{b}^l$, which are independent of the activity in previous layers. By propagating this Gaussian distribution across one layer, we obtain an integrative map for $q_l$:

$$q_l = \mathcal{V}(q_{l-1}|\sigma_w,\sigma_b) \equiv \sigma^2_w \int \mathcal{D} z \phi \left( \sqrt{q^{l-1}}z\right)^2 + \sigma_b^2\; \text{for}\; l = 2, \cdots, D$$

where $\mathcal{D} z$ is the standard Gaussian measure:

$$\mathcal{D} z = \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz$$

and the initial condition is $q^1 = \sigma_w^2 q^0 + \sigma_b^2$, where $q^0 = \frac{1}{N_0} \mathbf{x}^0 \cdot \mathbf{x}^0$ is the length in the initial activity layer.

## Length Map

As a single input point \( x_0 \) propagates through the network, its length in downstream layers can either grow or shrink. To track the propagation of this length, we track the normalized squared length of the input vector at each layer.