In information theory, Rényi entropy refers to a class of measures, of entropy that are essentially logarithms of diversity indices. For special values of its parameter, the notion of Rényi entropy reproduces all of: Shannon entropy, Hartley entropy (max-entropy) and min-entropy.
Defintion
The Rényi entropy of $p$ at order $\alpha$ is:
$$H_\alpha(p) := \frac{1}{1-\alpha}\log\left( \sum_{i=1}^n (p_i)^\alpha \right)$$
Properties
The Rényi entropy is an anti-monotone function in the order-parameter $\alpha$. That is:
$$\alpha_1 \leq \alpha_2 \iff H_{\alpha_1}(p) \geq H_{\alpha_2}(p)$$
For various (limiting) values of $\alpha$ the Rényi entropy reduces to notions of entropy that are known by their own names:
Order | $0$ | $\lim_{\alpha \to 1}$ | $2$ | $\cdots$ | $\lim_{\alpha \to \infty}$ |
---|---|---|---|---|---|
Rényi entropy | max-entropy | Shannon entropy | collision entropy | … | min-entropy |
In particular, in terms of the above special cases, this means that:
Hartley entropy $\leq$ Shannon entropy $\leq$ collision entropy $\leq \cdots \leq$ min-entropy
Interpretation
The geometric interpretation of Rényi entropy can be understood in terms of the volume of the probability simplex. When $\alpha \to 1$, it reduces to Shannon entropy, representing the expected "surprise" or uncertainty of a random variable. As $\alpha$ varies, Rényi entropy captures different moments of the distribution, and its level sets can be visualized as hyper-surfaces in the probability space.
Therefore, Rényi divergence can be seen as a distance measure between probability distributions in a statistical manifold. It provides a way to quantify the difference between two distributions, and its geometry can be explored through the study of geodesics in the space of probability distributions.
Rényi Divergence
Main article: Rényi divergence
Related to Rényi entropy is the concept of Rényi divergence, a measure of how one probability distribution diverges from another. For two probability distributions $p$ and $q$, the Rényi divergence of order $\alpha$ is defined as:
$$D_\alpha(p || q) = \frac{1}{\alpha - 1} \log \left( \sum_i \frac{p_i^\alpha}{q_i^{\alpha-1}} \right)$$
Rényi divergence generalizes Kullback-Leibler divergence in much the same way as Rényi entropy generalizes Shannon entropy:
Order | $0$ | $\lim_{\alpha \to 1}$ | $2$ | $\cdots$ | $\lim_{\alpha \to \infty}$ |
---|---|---|---|---|---|
Rényi divergence | $-\log Q(\{i : p_i > 0\})$ | Kullback-Leibler divergence | $\log \left\langle \frac{p_i}{q_i} \right\rangle$ | … | $\log \sup_i \frac{p_i}{q_i}$ |
Rényi divergence can be seen as a distance measure in the space of probability distributions. It defines a Riemannian metric, giving rise to a manifold structure. Thus, Rényi divergence generalizes other divergence measures like Kullback-Leibler divergence and Hellinger distance, providing a unifying framework for understanding various distance measures in information geometry.
Properties
The value $\alpha = 1$, which gives the Shannon entropy and the Kullback–Leibler divergence, is the only value at which the chain rule of conditional probability holds exactly, both for absolute:
$$H(A,X) = H(A) + \mathbb{E}_{a \sim A} \big[ H(X| A=a) \big]$$
and relative entropies:
$$D_\mathrm{KL}(p(x|a)p(a)\|m(x,a)) = D_\mathrm{KL}(p(a)\|m(a)) + \mathbb{E}_{p(a)}\{D_\mathrm{KL}(p(x|a)\|m(x|a))\}$$
The latter in particular means that if we seek a distribution $p(x, a)$ which minimizes the divergence from some underlying prior measure $m(x, a)$, and we acquire new information which only affects the distribution of $a$, then the distribution of $p(x|a)$ remains $m(x|a)$, unchanged.
The other Rényi divergences satisfy the criteria of being positive and continuous, being invariant under 1-to-1 co-ordinate transformations, and of combining additively when $A$ and $X$ are independent, so that if $p(A, X) = p(A)p(X)$, then
$$H_\alpha(A,X) = H_\alpha(A) + H_\alpha(X) $$
$$D_\alpha(P(A)P(X)\|Q(A)Q(X)) = D_\alpha(P(A)\|Q(A)) + D_\alpha(P(X)\|Q(X))$$
The stronger properties of the $\alpha = 1$ quantities allow the definition of conditional information and mutual information from communication theory.
|