Mercer's Theorem

Mercer's theorem is a fundamental result in mathematics, particularly in functional analysis and operator theory. It plays a crucial role in the theory of integral equations and has significant applications in machine learning, especially in the construction of kernel methods.

# Overview

Mercer's theorem is a result concerning the representation of a continuous, symmetric, and positive definite kernel function. It states that such a kernel $k(x, y)$ can be expressed as a sum:
$$k(x, y) = \sum_{n=1}^{\infty} \lambda_n \phi_n(x) \phi_n(y)$$
Where:

• $\lambda_n$ are non-negative eigenvalues
• $\phi_n(x)$ are corresponding eigenfunctions.

The theorem ensures that the kernel can be represented in terms of orthogonal functions, providing a solid foundation for various mathematical and computational applications.

Mercer's theorem provides a bridge between linear algebra and functional analysis, allowing for the decomposition of a kernel function into a series of eigenfunctions. This decomposition is essential in various fields, including statistics, machine learning, and signal processing. By understanding the spectral properties of a kernel, Mercer's theorem offers insights into the underlying structure of the data, facilitating more sophisticated modeling and analysis.

# Definition

To explain Mercer's theorem, we first consider an important special case; see below for a more general formulation. A kernel, in this context, is a symmetric continuous function $k : [a,b] \times [a,b] \to \mathbb{R}$ that is symmetric, that is $k(x,y) = k(y,x)$ for all $x, y \in [a,b]$.

Kernel $k$ is said to be positive-definite if and only if
$\sum _{i=1}^{n}\sum _{j=1}^{n}k(x_{i},x_{j})c_{i}c_{j}\geq 0$
for all finite sequences of points $x_1, \cdots, x_n$ of $[a, b]$ and all choices of real numbers $c_1, \cdots, c_n$. Note that the term "positive-definite" is well-established in literature despite the weak inequality in the definition.

Associated to $k$ is a linear operator (more specifically a Hilbert–Schmidt integral operator) on functions defined by the integral

$$[T_{k}\varphi ](x)=\int _{a}^{b}k(x,s)\varphi (s)\,ds.$$
For technical considerations we assume $\varphi$ can range through the space $L^2[a, b]$ (see Lp space) of square-integrable real-valued functions. Since $T_k$ is a linear operator, we can talk about eigenvalues and eigenfunctions of $T_k$.

 Mercer's Theorem Theorem. Let $k$ be a continuous symmetric positive-definite kernel function $k : [a,b]\times[a,b] \to \mathbb{R}$. Then there is an orthonormal basis $\{e_i\}_i$ of $L^2[a, b]$ consisting of eigenfunctions of Hilbert-Schmidt integral operator $T_k$ such that the corresponding sequence of eigenvalues $\{\lambda_i\}_i$ is nonnegative. The eigenfunctions corresponding to non-vanishing eigenvalues are continuous on $[a, b]$ and $k$ has the representation $$K(s,t)=\sum _{j=1}^{\infty }\lambda _{j}\,e_{j}(s)\,e_{j}(t)$$ where the convergence is absolute and uniform.

# Spectral Theory

Main article: Spectral theory

The proof of Mercer's theorem relies heavily on the spectral theory of compact operators. It begins by considering the integral operator associated with the kernel and then studies its spectral properties. The compactness of the operator ensures the existence of a countable set of eigenvalues and corresponding eigenfunctions.

The structure of the proof highlights the deep connection between Mercer's theorem and the spectral theory of compact operators. By understanding the eigenvalues and eigenfunctions of the operator, one can derive the representation of the kernel function in terms of these spectral components. This connection has profound implications in various fields, including functional analysis, differential equations, and mathematical physics.

# Mercer's Condition

Mercer's condition is a specific requirement that ensures the applicability of Mercer's theorem. It requires that the kernel function be continuous, symmetric, and positive definite. These properties guarantee that the kernel can be decomposed into a series of orthogonal functions, allowing for the application of Mercer's theorem.

## Kernel Trick

Main article: Kernel trick

The kernel trick is a technique in machine learning that leverages Mercer's theorem to implicitly map data into a higher-dimensional space. By using a kernel function that satisfies Mercer's condition, one can apply linear methods in the transformed space without explicitly dealing with the high-dimensional representation. This trick has led to groundbreaking developments in various algorithms and models, including support vector machines and kernel PCA.