Maximum entropy probability distribution

In statistics and information theory, a maximum entropy probability distribution has entropy that is at least as great as that of all other members of a specified class of probability distributions. According to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a certain class (usually defined in terms of specified properties or measures), then the distribution with the largest entropy should be chosen as the least-informative default. The motivation is twofold: first, maximizing entropy minimizes the amount of prior information built into the distribution; second, many physical systems tend to move towards maximal entropy configurations over time.

Definition of entropy and differential entropy

Further information: Entropy (information theory)

If X is a discrete random variable with distribution given by

\operatorname {Pr} (X=x_{k})=p_{k}\quad {\mbox{ for }}k=1,2,\ldots

then the entropy of X is defined as

H(X)=-\sum _{k\geq 1}p_{k}\log p_{k}\;.

If X is a continuous random variable with probability density p(x), then the differential entropy of X is defined as^[1]^[2]^[3]

H(X)=-\int _{-\infty }^{\infty }p(x)\log p(x)dx\;.

p(x) log p(x) is understood to be zero whenever p(x) = 0.

This is a special case of more general forms described in the articles Entropy (information theory), Principle of maximum entropy, and differential entropy. In connection with maximum entropy distributions, this is the only one needed, because maximizing $H(X)$ will also maximize the more general forms.

The base of the logarithm is not important as long as the same one is used consistently: change of base merely results in a rescaling of the entropy. Information theorists may prefer to use base 2 in order to express the entropy in bits; mathematicians and physicists will often prefer the natural logarithm, resulting in a unit of nats for the entropy.

Distributions with measured constants

Many statistical distributions of applicable interest are those for which the moments or other measurable quantities are constrained to be constants. The following theorem by Ludwig Boltzmann gives the form of the probability density under these constraints.

Continuous version

Suppose S is a closed subset of the real numbers R and we choose to specify n measurable functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class C of all real-valued random variables which are supported on S (i.e. whose density function is zero outside of S) and which satisfy the n expected value conditions

\operatorname {E} (f_{j}(X))=a_{j}\quad {\mbox{ for }}j=1,\ldots ,n

If there is a member in C whose density function is positive everywhere in S, and if there exists a maximal entropy distribution for C, then its probability density p(x) has the following shape:

p(x)=c\exp \left(\sum _{j=1}^{n}\lambda _{j}f_{j}(x)\right)\quad {\mbox{ for all }}x\in S

where the constants c and λ_j have to be determined so that the integral of p(x) over S is 1 and the above conditions for the expected values are satisfied. Conversely, if constants c and λ_j like this can be found, then p(x) is indeed the density of the (unique) maximum entropy distribution for our class C.

Discrete version

Suppose S = {x₁,x₂,...} is a (finite or infinite) discrete subset of the reals and we choose to specify n functions f₁,...,f_n and n numbers a₁,...,a_n. We consider the class C of all discrete random variables X which are supported on S and which satisfy the n conditions

\operatorname {E} (f_{j}(X))=a_{j}\quad {\mbox{ for }}j=1,\ldots ,n

If there exists a member of C which assigns positive probability to all members of S and if there exists a maximum entropy distribution for C, then this distribution has the following shape:

\operatorname {Pr} (X=x_{k})=c\exp \left(\sum _{j=1}^{n}\lambda _{j}f_{j}(x_{k})\right)\quad {\mbox{ for }}k=1,2,\ldots

where the constants c and λ_j have to be determined so that the sum of the probabilities is 1 and the above conditions for the expected values are satisfied. Conversely, if constants c and λ_j like this can be found, then the above distribution is indeed the maximum entropy distribution for our class C.

Proof

This theorem is proved with the calculus of variations and Lagrange multipliers. The constraints can be written as

$\int _{-\infty }^{\infty }f_{j}(x)p(x)dx=a_{j}$

We consider the functional

$J(p(x))=-\int _{-\infty }^{\infty }p(x)\ln {p(x)}dx+\lambda _{0}\left(\int _{-\infty }^{\infty }p(x)dx-1\right)+\sum _{j=1}^{n}\lambda _{j}\left(\int _{-\infty }^{\infty }f_{j}(x)p(x)dx-a_{j}\right)$

where the $\lambda _{j}$ are the Lagrange multipliers. The zeroth constraint ensures the second axiom of probability. The other constraints are that the measurements of the function are given constants up to order $n$ . The entropy attains an extremum when the functional derivative is equal to zero:

${\frac {\delta {J(p(x))}}{\delta {p(x)}}}=-\ln {p(x)}-1+\lambda _{0}+\sum _{j=1}^{n}\lambda _{j}f_{j}(x)=0$

It is an exercise for the reader that this extremum is a maximum. Therefore, the maximum entropy probability distribution in this case must be of the form

$p(x)=e^{-1+\lambda _{0}}\cdot e^{\sum _{j=1}^{n}\lambda _{j}f_{j}(x)}=c\cdot \exp \left(\sum _{j=1}^{n}\lambda _{j}f_{j}(x)\right)\;.$

The proof of the discrete version is essentially the same.

Caveats

Note that not all classes of distributions contain a maximum entropy distribution. It is possible that a class contain distributions of arbitrarily large entropy (e.g. the class of all continuous distributions on R with mean 0 but arbitrary standard deviation), or that the entropies are bounded above but there is no distribution which attains the maximal entropy (e.g. the class of all continuous distributions X on R with E(X) = 0 and E(X²) = E(X³) = 1 (See Cover, Ch 12)). It is also possible that the expected value restrictions for the class C force the probability distribution to be zero in certain subsets of S. In that case our theorem doesn't apply, but one can work around this by shrinking the set S.

Examples

Every probability distribution is trivially a maximum entropy probability distribution under the constraint that the distribution have its own entropy. To see this, rewrite the density as $p(x)=\exp {(\ln {p(x)})}$ and compare to the expression of the theorem above. By choosing $\ln {p(x)}\rightarrow f(x)$ to be the measurable function and $\int \exp {(f(x))}f(x)dx=-H$ to be the constant, $p(x)$ is the maximum entropy probability distribution under the constraint $\int p(x)f(x)dx=-H$ .

Nontrivial examples are distributions that are subject to multiple constraints that are different from the assignment of the entropy. These are often found by starting with the same procedure $\ln {p(x)}\rightarrow f(x)$ and finding that $f(x)$ can be separated into parts.

A table of examples of maximum entropy distributions is given in Lisman (1972) ^[4] and Park & Bera (2009)^[5]

Uniform and piecewise uniform distributions

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b], and thus the probability density is 0 outside of the interval. This uniform density can be related to Laplace's principle of indifference, sometimes called the principle of insufficient reason. More generally, if we're given a subdivision a=a₀ < a₁ < ... < a_k = b of the interval [a,b] and probabilities p₁,...,p_k which add up to one, then we can consider the class of all continuous distributions such that

\operatorname {Pr} (a_{j-1}\leq X<a_{j})=p_{j}\quad {\mbox{ for }}j=1,\ldots ,k

The density of the maximum entropy distribution for this class is constant on each of the intervals [a_j-1,a_j). The uniform distribution on the finite set {x₁,...,x_n} (which assigns a probability of 1/n to each of these values) is the maximum entropy distribution among all discrete distributions supported on this set.

Positive and specified mean: the exponential distribution

The exponential distribution, for which the density function is

p(x|\lambda )={\begin{cases}\lambda e^{-\lambda x}&x\geq 0,\\0&x<0,\end{cases}}

is the maximum entropy distribution among all continuous distributions supported in [0,∞] that have a specified mean of 1/λ.

Specified variance: the normal distribution

The normal distribution N(μ,σ²), for which the density function is

p(x|\mu ,\sigma )={\frac {1}{\sigma {\sqrt {2\pi }}}}e^{-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}},

has maximum entropy among all real-valued distributions with a specified variance σ² (a particular moment). Therefore, the assumption of normality imposes the minimal prior structural constraint beyond this moment. (See the differential entropy article for a derivation.)

Discrete distributions with specified mean

Among all the discrete distributions supported on the set {x₁,...,x_n} with a specified mean μ, the maximum entropy distribution has the following shape:

\operatorname {Pr} (X=x_{k})=Cr^{x_{k}}\quad {\mbox{ for }}k=1,\ldots ,n

where the positive constants C and r can be determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ.

For example, if a large number N of dice are thrown, and you are told that the sum of all the shown numbers is S. Based on this information alone, what would be a reasonable assumption for the number of dice showing 1, 2, ..., 6? This is an instance of the situation considered above, with {x₁,...,x₆} = {1,...,6} and μ = S/N.

Finally, among all the discrete distributions supported on the infinite set {x₁,x₂,...} with mean μ, the maximum entropy distribution has the shape:

\operatorname {Pr} (X=x_{k})=Cr^{x_{k}}\quad {\mbox{ for }}k=1,2,\ldots ,

where again the constants C and r were determined by the requirements that the sum of all the probabilities must be 1 and the expected value must be μ. For example, in the case that x_k = k, this gives

C={\frac {1}{1-\mu }},\quad \quad r={\frac {1-\mu }{\mu }},

such that respective maximum entropy distribution is the geometric distribution.

Circular random variables

For a continuous random variable $\theta _{i}$ distributed about the unit circle, the Von Mises distribution maximizes the entropy when the real and imaginary parts of the first circular moment are specified^[6] or, equivalently, the circular mean and circular variance are specified.

When the mean and variance of the angles $\theta _{i}$ modulo $2\pi$ are specified, the wrapped normal distribution maximizes the entropy.^[6]

Maximizer for specified mean, variance and skew

There exists an upper bound on the entropy of continuous random variables on $\mathbb {R}$ with a specified mean, variance, and skew. However, there is no distribution which achieves this upper bound because $p(x)=c\exp {(\lambda _{1}x+\lambda _{2}x^{2}+\lambda _{3}x^{3})}$ is unbounded except when $\lambda _{3}=0$ (see Cover, chapter 12). Thus, we cannot construct a maximum entropy distribution given these constraints.

However, the maximum entropy is $\epsilon$ -achievable: a distribution's entropy can be arbitrarily close to the upper bound. Start with a normal distribution of the specified mean and variance. To introduce a positive skew, perturb the normal distribution upward by a small amount at a value many $\sigma$ larger than the mean. The skewness, being proportional to the third moment, will be affected more than the lower order moments.

Other examples

In the table below, each listed distribution maximizes the entropy for a particular set of functional constraints listed in the third column, and the constraint that x be included in the support of the probability density, which is listed in the fourth column.^[4] ^[5] Several examples (Bernoulli, geometric, exponential, Laplace, Pareto) listed are trivially true because their associated constraints are equivalent to the assignment of their entropy. They are included anyway because their constraint is related to a common or easily measured quantity. For reference, $\Gamma (x)=\int _{0}^{\infty }e^{-t}t^{x-1}dt$ is the gamma function, $\psi (x)={\frac {d}{dx}}\ln \Gamma (x)={\frac {\Gamma '(x)}{\Gamma (x)}}$ is the digamma function, $B(p,q)={\frac {\Gamma (p)\Gamma (q)}{\Gamma (p+q)}}$ is the beta function, and γ_E is Euler's constant.

Table of probability distributions and corresponding maximum entropy constraints
Distribution Name	Probability density/mass function	Maximum Entropy Constraint	Support
Uniform (discrete)	$f(k)={\frac {1}{b-a+1}}$	None	$\{a,a+1,...,b-1,b\}\,$
Uniform (continuous)	$f(x)={\frac {1}{b-a}}$	None	$[a,b]\,$
Bernoulli	$f(k)=p^{k}(1-p)^{1-k}$	$E(k)=p\,$	$\{0,1\}\,$
Geometric	$f(k)=(1-p)^{k-1}\,p$	$E(k)={\frac {1}{p}}\,$	$\{1,2,3,...\}\,$
Exponential	$f(x)=\lambda \exp \left(-\lambda x\right)$	$E(x)={\frac {1}{\lambda }}\,$	$[0,\infty )\,$
Laplace	$f(x)={\frac {1}{2b}}\exp \left(-{\frac {\|x-\mu \|}{b}}\right)$	$E(\|x-\mu \|)=b\,$	$(-\infty ,\infty )\,$
Asymmetric Laplace	$f(x)={\frac {\lambda \,e^{-(x-m)\lambda s\kappa ^{s}}}{\kappa +1/\kappa }}\,(s\!=\!\operatorname {sgn}(x\!-\!m))$	$E((x-m)s\kappa ^{s})=1/\lambda \,$	$(-\infty ,\infty )\,$
Pareto	$f(x)={\frac {\alpha x_{m}^{\alpha }}{x^{\alpha +1}}}$	$E(\ln(x))={\frac {1}{\alpha }}+\ln(x_{m})\,$	$[x_{m},\infty )\,$
Normal	$f(x)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}\exp \left(-{\frac {(x-\mu )^{2}}{2\sigma ^{2}}}\right)$	$E(x)=\mu ,\,E((x-\mu )^{2})=\sigma ^{2}$	$(-\infty ,\infty )\,$
von Mises	$f(\theta )={\frac {1}{2\pi I_{0}(\kappa )}}\exp {(\kappa \cos {(\theta -\mu )})}$	$E(\cos \theta )={\frac {I_{1}(\kappa )}{I_{0}(\kappa )}}\cos \mu ,\,E(\sin \theta )={\frac {I_{1}(\kappa )}{I_{0}(\kappa )}}\sin \mu$	$[0,2\pi )\,$
Rayleigh	$f(x)={\frac {x}{\sigma ^{2}}}\exp \left(-{\frac {x^{2}}{2\sigma ^{2}}}\right)$	$E(x^{2})=2\sigma ^{2},E(\ln(x))={\frac {\ln(2\sigma ^{2})-\gamma _{E}}{2}}\,$	$[0,\infty )\,$
Beta	$f(x)={\frac {x^{\alpha -1}(1-x)^{\beta -1}}{B(\alpha ,\beta )}}$ for $0\leq x\leq 1$	$E(\ln(x))=\psi (\alpha )-\psi (\alpha +\beta )\,$ $E(\ln(1-x))=\psi (\beta )-\psi (\alpha +\beta )\,$	$[0,1]\,$
Cauchy	$f(x)={\frac {1}{\pi (1+x^{2})}}$	$E(\ln(1+x^{2}))=2\ln 2$	$(-\infty ,\infty )\,$
Chi	$f(x)={\frac {2}{2^{k/2}\Gamma (k/2)}}x^{k-1}\exp \left(-{\frac {x^{2}}{2}}\right)$	$E(x^{2})=k,\,E(\ln(x))={\frac {1}{2}}\left[\psi \left({\frac {k}{2}}\right)\!+\!\ln(2)\right]$	$[0,\infty )\,$
Chi-squared	$f(x)={\frac {1}{2^{k/2}\Gamma (k/2)}}x^{{\frac {k}{2}}\!-\!1}\exp \left(-{\frac {x}{2}}\right)$	$E(x)=k,\,E(\ln(x))=\psi \left({\frac {k}{2}}\right)+\ln(2)$	$[0,\infty )\,$
Erlang	$f(x)={\frac {\lambda ^{k}}{(k-1)!}}x^{k-1}\exp(-\lambda x)$	$E(x)=k/\lambda ,\,E(\ln(x))=\psi (k)-\ln(\lambda )$	$[0,\infty )\,$
Gamma	$f(x)={\frac {x^{k-1}\exp(-{\frac {x}{\theta }})}{\theta ^{k}\Gamma (k)}}$	$E(x)=k\theta ,\,E(\ln(x))=\psi (k)+\ln(\theta )$	$[0,\infty )\,$
Lognormal	$f(x)={\frac {1}{\sigma x{\sqrt {2\pi }}}}\exp \left(-{\frac {(\ln x-\mu )^{2}}{2\sigma ^{2}}}\right)$	$E(\ln(x))=\mu ,E((\ln(x)-\mu )^{2})=\sigma ^{2}\,$	$[0,\infty )\,$
Maxwell–Boltzmann	$f(x)={\frac {1}{a^{3}}}{\sqrt {\frac {2}{\pi }}}\,x^{2}\exp \left(-{\frac {x^{2}}{2a^{2}}}\right)$	$E(x^{2})=3a^{2},\,E(\ln(x))\!=\!1\!+\!\ln \left({\frac {a}{\sqrt {2}}}\right)\!-\!{\frac {\gamma _{E}}{2}}$	$[0,\infty )\,$
Weibull	$f(x)={\frac {k}{\lambda ^{k}}}x^{k-1}\exp \left(-{\frac {x^{k}}{\lambda ^{k}}}\right)$	$E(x^{k})=\lambda ^{k},E(\ln(x))=\ln(\lambda )-{\frac {\gamma _{E}}{k}}\,$	$[0,\infty )\,$
Multivariate normal	$f_{X}({\vec {x}})=$ ${\frac {\exp \left(-{\frac {1}{2}}({\vec {x}}-{\vec {\mu }})^{\top }\Sigma ^{-1}\cdot ({\vec {x}}-{\vec {\mu }})\right)}{(2\pi )^{N/2}\left\|\Sigma \right\|^{1/2}}}$	$E({\vec {x}})={\vec {\mu }},\,E(({\vec {x}}-{\vec {\mu }})({\vec {x}}-{\vec {\mu }})^{T})=\Sigma \,$	$(-{\vec {\infty }},{\vec {\infty }})\,$
Binomial	$f(k)={n \choose k}p^{k}(1-p)^{n-k}$	$E(x)=\mu ,f\in {\text{n-generalized binomial distribution}}$ ^[7]	$[0,n]$
Poisson	$f(k)={\frac {\exp ^{-\lambda }\lambda ^{k}}{k!}}$	$E(x)=\lambda ,f\in {\infty }{\text{-generalized binomial distribution}}$ ^[7]	$[0,\infty ]$

Notes

↑ Williams, D. (2001) Weighing the Odds Cambridge UP ISBN 0-521-00618-X (pages 197-199)
↑ Bernardo, J.M., Smith, A.F.M. (2000) Bayesian Theory'.' Wiley. ISBN 0-471-49464-X (pages 209, 366)
↑ O'Hagan, A. (1994) Kendall's Advanced Theory of statistics, Vol 2B, Bayesian Inference, Edward Arnold. ISBN 0-340-52922-9 (Section 5.40)
1 2 Lisman, J. H. C.; van Zuylen, M. C. A. (1972). "Note on the generation of most probable frequency distributions". Statistica Neerlandica. 26 (1): 19–23. doi:10.1111/j.1467-9574.1972.tb00152.x.
1 2 Park, Sung Y.; Bera, Anil K. (2009). "Maximum entropy autoregressive conditional heteroskedasticity model" (PDF). Journal of Econometrics. Elsevier: 219–230. Retrieved 2011-06-02.
1 2 Jammalamadaka, S. Rao; SenGupta, A. (2001). Topics in circular statistics. New Jersey: World Scientific. ISBN 981-02-3778-2. Retrieved 2011-05-15.
1 2 Harremös, Peter (2001). "Binomial and Poisson Distribution as Maximum Entropy Distributions". IEEE Transactions on Information Theory. 47 (5).

References

Cover, T. M.; Thomas, J. A. (2006). "Chapter 12, Maximum Entropy". Elements of Information Theory (PDF) (2 ed.). Wiley. ISBN 0471241954.
I. J. Taneja, Generalized Information Measures and Their Applications 2001. Chapter 1

Probability distributions

List

Discrete univariate with finite support	Benford Bernoulli beta-binomial binomial categorical hypergeometric Poisson binomial Rademacher discrete uniform Zipf Zipf–Mandelbrot

Discrete univariate with infinite support	beta negative binomial Borel Conway–Maxwell–Poisson discrete phase-type Delaporte extended negative binomial Gauss–Kuzmin geometric logarithmic negative binomial parabolic fractal Poisson Skellam Yule–Simon zeta

Continuous univariate supported on a bounded interval	arcsine ARGUS Balding–Nichols Bates beta beta rectangular Irwin–Hall Kumaraswamy logit-normal noncentral beta raised cosine reciprocal triangular U-quadratic uniform Wigner semicircle

Continuous univariate supported on a semi-infinite interval	Benini Benktander 1st kind Benktander 2nd kind beta prime Burr chi-squared chi Dagum Davis exponential-logarithmic Erlang exponential F folded normal Flory–Schulz Fréchet gamma gamma/Gompertz generalized inverse Gaussian Gompertz half-logistic half-normal Hotelling's T-squared hyper-Erlang hyperexponential hypoexponential inverse chi-squared scaled inverse chi-squared inverse Gaussian inverse gamma Kolmogorov Lévy log-Cauchy log-Laplace log-logistic log-normal Lomax matrix-exponential Maxwell–Boltzmann Maxwell–Jüttner Mittag-Leffler Nakagami noncentral chi-squared Pareto phase-type poly-Weibull Rayleigh relativistic Breit–Wigner Rice shifted Gompertz truncated normal type-2 Gumbel Weibull Discrete Weibull Wilks's lambda

Continuous univariate supported on the whole real line	Cauchy exponential power Fisher's z Gaussian q generalized normal generalized hyperbolic geometric stable Gumbel Holtsmark hyperbolic secant Johnson's S_U Landau Laplace asymmetric Laplace logistic noncentral t normal (Gaussian) normal-inverse Gaussian skew normal slash stable Student's t type-1 Gumbel Tracy–Widom variance-gamma Voigt

Continuous univariate with support whose type varies	generalized extreme value generalized Pareto Tukey lambda q-Gaussian q-exponential q-Weibull shifted log-logistic

Mixed continuous-discrete univariate	rectified Gaussian

Multivariate (joint)	Discrete Ewens multinomial Dirichlet-multinomial negative multinomial Continuous Dirichlet generalized Dirichlet multivariate normal multivariate stable multivariate t normal-inverse-gamma normal-gamma Matrix-valued inverse matrix gamma inverse-Wishart matrix normal matrix t matrix gamma normal-inverse-Wishart normal-Wishart Wishart

Directional	Univariate (circular) directional Circular uniform univariate von Mises wrapped normal wrapped Cauchy wrapped exponential wrapped asymmetric Laplace wrapped Lévy Bivariate (spherical) Kent Bivariate (toroidal) bivariate von Mises Multivariate von Mises–Fisher Bingham

Degenerate and singular	Degenerate Dirac delta function Singular Cantor

Families	Circular compound Poisson elliptical exponential natural exponential location-scale maximum entropy mixture Pearson Tweedie wrapped

This article is issued from Wikipedia - version of the 11/3/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.