Log-Likelihood

The log-likelihood is an essential concept in statistical modeling and machine learning, related to probability theory and information theory. It's a logarithmic transformation of the likelihood function, playing a key role in parameter estimation, model selection, and hypothesis testing. The log-likelihood is also related to the Kullback-Leibler divergence and to Boltzmann distribution as understood in context of energy-based models in machine learning.

# Overview

The log-likelihood function is given by the natural logarithm of the likelihood function, expressed as:

$$l(\theta) = \log L(\theta) = \sum_{i=1}^n \log f(x_i; \theta)$$

Here, $L(\theta)$ is the likelihood function, $f(x_i; \theta)$ is the probability density function (pdf), and $\theta$ is the vector of parameters. The log-likelihood is easier to work with computationally because it turns products into sums.

# Properties

The log-likelihood has several important properties:

• Concavity: Under regular conditions, the log-likelihood is a concave function, leading to a unique maximum.
• Invariance: The maximum likelihood estimator (MLE) is invariant under transformation, and this invariance extends to the log-likelihood.
• Asymptotic Normality: Asymptotically, the MLE follows a normal distribution, and the log-likelihood has a well-known Fisher information matrix.

## Applications

Applications of the log-likelihood include:

### Statistical Mechanics

In statistical mechanics, the log-likelihood plays a role in mean-field theory, connecting with free energy and entropy.

### Energy-Based Models

Main article: Energy based model

In the context of energy-based models (EBMs), the log-likelihood is connected to the Boltzmann distribution. The negative log-likelihood is often utilized as a loss function to be minimized during training. An expression for the negative log-likelihood in EBMs is:

$$-l(\mathbf{w}) = \langle E(\mathbf{x}; \mathbf{w})\rangle_q - F_\mathbf{w}$$

Where $\langle \cdot \rangle_q$ denotes an average with respect to the data distribution $q(\mathbf{x})$, and $F_\mathbf{w} = -\log Z_\mathbf{w}$ is the Helmholtz free energy of the model distribution $p(\mathbf{x}; \mathbf{w})$. Learning corresponds to maximizing the log-likelihood or minimizing the negative log-likelihood.