Likelihood function

The likelihood function (often simply called the likelihood) is the

observed data viewed as a function of the parameters of a statistical model.^[1]^[2]^[3]

{\mathcal {L}}(\theta \mid x)

is the probability of observing data

x

assuming

\theta

is the actual parameter.

In maximum likelihood estimation, the arg max (over the parameter $\theta$ ) of the likelihood function serves as a point estimate for $\theta$ , while the Fisher information (often approximated by the likelihood's Hessian matrix) indicates the estimate's precision.

In contrast, in Bayesian statistics, parameter estimates are derived from the converse of the likelihood, the so-called posterior probability, which is calculated via Bayes' rule.^[4]

Definition

The likelihood function, parameterized by a (possibly multivariate) parameter $\theta$ , is usually defined differently for discrete and continuous probability distributions (a more general definition is discussed below). Given a probability density or mass function

x\mapsto f(x\mid \theta ),

where $x$ is a realization of the random variable $X$ , the likelihood function is

\theta \mapsto f(x\mid \theta ),

often written

{\mathcal {L}}(\theta \mid x).

In other words, when $f(x\mid \theta )$ is viewed as a function of $x$ with $\theta$ fixed, it is a probability density function, and when viewed as a function of $\theta$ with $x$ fixed, it is a likelihood function. In the frequentist paradigm, the notation $f(x\mid \theta )$ is often avoided and instead $f(x;\theta )$ or $f(x,\theta )$ are used to indicate that $\theta$ is regarded as a fixed unknown quantity rather than as a random variable being conditioned on.

The likelihood function does not specify the probability that $\theta$ is the truth, given the observed sample $X=x$ . Such an interpretation is a common error, with potentially disastrous consequences (see

prosecutor's fallacy

).

Discrete probability distribution

Let $X$ be a discrete random variable with probability mass function $p$ depending on a parameter $\theta$ . Then the function

{\mathcal {L}}(\theta \mid x)=p_{\theta }(x)=P_{\theta }(X=x),

considered as a function of $\theta$ , is the likelihood function, given the outcome $x$ of the random variable $X$ . Sometimes the probability of "the value $x$ of $X$ for the parameter value $\theta$ " is written as $P (X = x | θ)$ or $P (X = x; θ)$ . The likelihood is the probability that a particular outcome $x$ is observed when the true value of the parameter is $\theta$ , equivalent to the probability mass on $x$ ; it is not a probability density over the parameter $\theta$ . The likelihood, ${\mathcal {L}}(\theta \mid x)$ , should not be confused with $P(\theta \mid x)$ , which is the posterior probability of $\theta$ given the data $x$ .

Given no event (no data), the likelihood is 1;^{[citation needed]} any non-trivial event will have a lower likelihood.

Example

Figure 1. The likelihood function ( $p_{\text{H}}^{2}$ ) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HH.

Figure 2. The likelihood function ( $p_{\text{H}}^{2}(1-p_{\text{H}})$ ) for the probability of a coin landing heads-up (without prior knowledge of the coin's fairness), given that we have observed HHT.

Consider a simple statistical model of a coin flip: a single parameter $p_{\text{H}}$ that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed. $p_{\text{H}}$ can take on any value within the range 0.0 to 1.0. For a perfectly fair coin, $p_{\text{H}}=0.5$ .

Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip is i.i.d., then the probability of observing HH is

P({\text{HH}}\mid p_{\text{H}}=0.5)=0.5^{2}=0.25.

Equivalently, the likelihood of observing "HH" assuming $p_{\text{H}}=0.5$ is

{\mathcal {L}}(p_{\text{H}}=0.5\mid {\text{HH}})=0.25.

This is not the same as saying that $P(p_{\text{H}}=0.5\mid HH)=0.25$ , a conclusion which could only be reached via Bayes' theorem given knowledge about the marginal probabilities $P(p_{\text{H}}=0.5)$ and $P({\text{HH}})$ .

Now suppose that the coin is not a fair coin, but instead that $p_{\text{H}}=0.3$ . Then the probability of two heads on two flips is

P({\text{HH}}\mid p_{\text{H}}=0.3)=0.3^{2}=0.09.

Hence

{\mathcal {L}}(p_{\text{H}}=0.3\mid {\text{HH}})=0.09.

More generally, for each value of $p_{\text{H}}$ , we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure 1. The integral of ${\mathcal {L}}$ over [0, 1] is 1/3; likelihoods need not integrate or sum to one over the parameter space.

Continuous probability distribution

Let $X$ be a random variable following an absolutely continuous probability distribution with density function $f$ (a function of $x$ ) which depends on a parameter $\theta$ . Then the function

{\mathcal {L}}(\theta \mid x)=f_{\theta }(x),

considered as a function of $\theta$ , is the likelihood function (of $\theta$ , given the outcome $X=x$ ). Again, ${\mathcal {L}}$ is not a probability density or mass function over $\theta$ , despite being a function of $\theta$ given the observation $X=x$ .

Relationship between the likelihood and probability density functions

The use of the probability density in specifying the likelihood function above is justified as follows. Given an observation $x_{j}$ , the likelihood for the interval $[x_{j},x_{j}+h]$ , where $h>0$ is a constant, is given by ${\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])$ . Observe that

\mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h]),

since

h

is positive and constant. Because

\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\Pr(x_{j}\leq x\leq x_{j}+h\mid \theta )=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx,

where $f(x\mid \theta )$ is the probability density function, it follows that

\mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])=\mathop {\operatorname {arg\,max} } _{\theta }{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx.

The first fundamental theorem of calculus provides that

\lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx=f(x_{j}\mid \theta ).

Then

{\begin{aligned}&\mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\mathop {\operatorname {arg\,max} } _{\theta }\left[\lim _{h\to 0^{+}}{\mathcal {L}}(\theta \mid x\in [x_{j},x_{j}+h])\right]\\[4pt]={}&\mathop {\operatorname {arg\,max} } _{\theta }\left[\lim _{h\to 0^{+}}{\frac {1}{h}}\int _{x_{j}}^{x_{j}+h}f(x\mid \theta )\,dx\right]=\mathop {\operatorname {arg\,max} } _{\theta }f(x_{j}\mid \theta ).\end{aligned}}

Therefore,

\mathop {\operatorname {arg\,max} } _{\theta }{\mathcal {L}}(\theta \mid x_{j})=\mathop {\operatorname {arg\,max} } _{\theta }f(x_{j}\mid \theta ),

and so maximizing the probability density at

x_{j}

amounts to maximizing the likelihood of the specific observation

x_{j}

.

In general

In measure-theoretic probability theory, the density function is defined as the Radon–Nikodym derivative of the probability distribution relative to a common dominating measure.^[5] The likelihood function is this density interpreted as a function of the parameter, rather than the random variable.^[6] Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.)

The above discussion of the likelihood for discrete random variables uses the counting measure, under which the probability density at any outcome equals the probability of that outcome.

Likelihoods for mixed continuous–discrete distributions

The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses $p_{k}\theta$ and a density $f(x\mid \theta )$ , where the sum of all the $p$ 's added to the integral of $f$ is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply

{\mathcal {L}}(\theta \mid x)=p_{k}(\theta ),

where

k

is the index of the discrete probability mass corresponding to observation

x

, because maximizing the probability mass (or probability) at

x

amounts to maximizing the likelihood of the specific observation.

The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation $x$ , but not with the parameter $\theta$ .

Regularity conditions

In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions are assumed in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation, the existence of a global maximum of the likelihood function is of the utmost importance. By the

compact parameter space for the maximum likelihood estimator to exist.^[7] While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case, concavity

of the likelihood function plays a key role.

More specifically, if the likelihood function is twice continuously differentiable on the k-dimensional parameter space $\Theta$ assumed to be an open connected subset of $\mathbb {R} ^{k}\,,$ there exists a unique maximum ${\hat {\theta }}\in \Theta$ if the matrix of second partials

\mathbf {H} (\theta )\equiv \left[\,{\frac {\partial ^{2}L}{\,\partial \theta _{i}\,\partial \theta _{j}\,}}\,\right]_{i,j=1,1}^{n_{\mathrm {i} },n_{\mathrm {j} }}\;

is

negative definite

for every

\,\theta \in \Theta \,

at which the gradient

\;\nabla L\equiv \left[\,{\frac {\partial L}{\,\partial \theta _{i}\,}}\,\right]_{i=1}^{n_{\mathrm {i} }}\;

vanishes, and if the likelihood function approaches a constant on the boundary of the parameter space,

\;\partial \Theta \;,

i.e.,

\lim _{\theta \to \partial \Theta }L(\theta )=0\;,

which may include the points at infinity if

\,\Theta \,

is unbounded. Mäkeläinen and co-authors prove this result using Morse theory while informally appealing to a mountain pass property.^[8] Mascarenhas restates their proof using the mountain pass theorem.^[9]

In the proofs of consistency and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about the probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda.^[10] In particular, for almost all $x$ , and for all $\,\theta \in \Theta \,,$

{\frac {\partial \log f}{\partial \theta _{r}}}\,,\quad {\frac {\partial ^{2}\log f}{\partial \theta _{r}\partial \theta _{s}}}\,,\quad {\frac {\partial ^{3}\log f}{\partial \theta _{r}\,\partial \theta _{s}\,\partial \theta _{t}}}\,

exist for all

\,r,s,t=1,2,\ldots ,k\,

in order to ensure the existence of a

Taylor expansion

. Second, for almost all

x

and for every

\,\theta \in \Theta \,

it must be that

\left|{\frac {\partial f}{\partial \theta _{r}}}\right|<F_{r}(x)\,,\quad \left|{\frac {\partial ^{2}f}{\partial \theta _{r}\,\partial \theta _{s}}}\right|<F_{rs}(x)\,,\quad \left|{\frac {\partial ^{3}f}{\partial \theta _{r}\,\partial \theta _{s}\,\partial \theta _{t}}}\right|<H_{rst}(x)

where

H

is such that

\,\int _{-\infty }^{\infty }H_{rst}(z)\mathrm {d} z\leq M<\infty \;.

This boundedness of the derivatives is needed to allow for

information matrix

,

\mathbf {I} (\theta )=\int _{-\infty }^{\infty }{\frac {\partial \log f}{\partial \theta _{r}}}\ {\frac {\partial \log f}{\partial \theta _{s}}}\ f\ \mathrm {d} z

is

positive definite

and

\,\left|\mathbf {I} (\theta )\right|\,

is finite. This ensures that the

score has a finite variance.^[11]

The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed.

In Bayesian statistics, almost identical regularity conditions are imposed on the likelihood function in order to proof asymptotic normality of the

Laplace approximation of the posterior in large samples.^[14]

Likelihood ratio and relative likelihood

Likelihood ratio

A likelihood ratio is the ratio of any two specified likelihoods, frequently written as:

\Lambda (\theta _{1}:\theta _{2}\mid x)={\frac {{\mathcal {L}}(\theta _{1}\mid x)}{{\mathcal {L}}(\theta _{2}\mid x)}}.

The likelihood ratio is central to

law of likelihood

states that degree to which data (considered as evidence) supports one parameter value versus another is measured by the likelihood ratio.

In

significance level. Numerous other tests can be viewed as likelihood-ratio tests or approximations thereof.^[15] The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given by Wilks' theorem

.

The likelihood ratio is also of central importance in

Bayes' rule. Stated in terms of odds

, Bayes' rule states that the posterior odds of two alternatives,

A_{1}

and

A_{2}

, given an event

B

, is the prior odds, times the likelihood ratio. As an equation:

O(A_{1}:A_{2}\mid B)=O(A_{1}:A_{2})\cdot \Lambda (A_{1}:A_{2}\mid B).

The likelihood ratio is not directly used in AIC-based statistics. Instead, what is used is the relative likelihood of models (see below).

In

diagnostic test

.

Relative likelihood function

Since the actual value of the likelihood function depends on the sample, it is often convenient to work with a standardized measure. Suppose that the

maximum likelihood estimate

for the parameter

θ

is

{\hat {\theta }}

. Relative plausibilities of other

θ

values may be found by comparing the likelihoods of those other values with the likelihood of

{\hat {\theta }}

. The relative likelihood of

θ

is defined to be^[16]^[17]^[18]^[19]^[20]

R(\theta )={\frac {{\mathcal {L}}(\theta \mid x)}{{\mathcal {L}}({\hat {\theta }}\mid x)}}.

Thus, the relative likelihood is the likelihood ratio (discussed above) with the fixed denominator

{\mathcal {L}}({\hat {\theta }})

. This corresponds to standardizing the likelihood to have a maximum of 1.

Likelihood region

A likelihood region is the set of all values of $θ$ whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, a $p$ % likelihood region for $θ$ is defined to be^[16]^[18]^[21]

\left\{\theta :R(\theta )\geq {\frac {p}{100}}\right\}.

If $θ$ is a single real parameter, a $p$ % likelihood region will usually comprise an interval of real values. If the region does comprise an interval, then it is called a likelihood interval.^[16]^[18]^[22]

Likelihood intervals, and more generally likelihood regions, are used for interval estimation within likelihoodist statistics: they are similar to confidence intervals in frequentist statistics and credible intervals in Bayesian statistics. Likelihood intervals are interpreted directly in terms of relative likelihood, not in terms of coverage probability (frequentism) or posterior probability (Bayesianism).

Given a model, likelihood intervals can be compared to confidence intervals. If $θ$ is a single real parameter, then under certain conditions, a 14.65% likelihood interval (about 1:7 likelihood) for $θ$ will be the same as a 95% confidence interval (19/20 coverage probability).^[16]^[21] In a slightly different formulation suited to the use of log-likelihoods (see Wilks' theorem), the test statistic is twice the difference in log-likelihoods and the probability distribution of the test statistic is approximately a chi-squared distribution with degrees-of-freedom (df) equal to the difference in df's between the two models (therefore, the $e$ ⁻² likelihood interval is the same as the 0.954 confidence interval; assuming difference in df's to be 1).^[21]^[22]

Likelihoods that eliminate nuisance parameters

In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered as nuisance parameters. Several alternative approaches have been developed to eliminate such nuisance parameters, so that a likelihood can be written as a function of only the parameter (or parameters) of interest: the main approaches are profile, conditional, and marginal likelihoods.^[23]^[24] These approaches are also useful when a high-dimensional likelihood surface needs to be reduced to one or two parameters of interest in order to allow a graph.

Profile likelihood

It is possible to reduce the dimensions by concentrating the likelihood function for a subset of parameters by expressing the nuisance parameters as functions of the parameters of interest and replacing them in the likelihood function.^[25]^[26] In general, for a likelihood function depending on the parameter vector $\mathbf {\theta }$ that can be partitioned into $\mathbf {\theta } =\left(\mathbf {\theta } _{1}:\mathbf {\theta } _{2}\right)$ , and where a correspondence $\mathbf {\hat {\theta }} _{2}=\mathbf {\hat {\theta }} _{2}\left(\mathbf {\theta } _{1}\right)$ can be determined explicitly, concentration reduces computational burden of the original maximization problem.^[27]

For instance, in a linear regression with normally distributed errors, $\mathbf {y} =\mathbf {X} \beta +u$ , the coefficient vector could be partitioned into $\beta =\left[\beta _{1}:\beta _{2}\right]$ (and consequently the design matrix $\mathbf {X} =\left[\mathbf {X} _{1}:\mathbf {X} _{2}\right]$ ). Maximizing with respect to $\beta _{2}$ yields an optimal value function $\beta _{2}(\beta _{1})=\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}\left(\mathbf {y} -\mathbf {X} _{1}\beta _{1}\right)$ . Using this result, the maximum likelihood estimator for $\beta _{1}$ can then be derived as

{\hat {\beta }}_{1}=\left(\mathbf {X} _{1}^{\mathsf {T}}\left(\mathbf {I} -\mathbf {P} _{2}\right)\mathbf {X} _{1}\right)^{-1}\mathbf {X} _{1}^{\mathsf {T}}\left(\mathbf {I} -\mathbf {P} _{2}\right)\mathbf {y}

where

\mathbf {P} _{2}=\mathbf {X} _{2}\left(\mathbf {X} _{2}^{\mathsf {T}}\mathbf {X} _{2}\right)^{-1}\mathbf {X} _{2}^{\mathsf {T}}

is the projection matrix of

\mathbf {X} _{2}

. This result is known as the Frisch–Waugh–Lovell theorem.

Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameter $\beta _{2}$ that maximizes the likelihood function, creating an isometric profile of the likelihood function for a given $\beta _{1}$ , the result of this procedure is also known as profile likelihood.

standard errors calculated from the full likelihood.^[30]^[31]

Conditional likelihood

Sometimes it is possible to find a sufficient statistic for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters.^[32]

One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the non-central hypergeometric distribution. This form of conditioning is also the basis for Fisher's exact test.

Marginal likelihood

Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linear

residual maximum likelihood

estimation of the variance components.

Partial likelihood

A partial likelihood is an adaption of the full likelihood such that only a part of the parameters (the parameters of interest) occur in it.^[33] It is a key component of the proportional hazards model: using a restriction on the hazard function, the likelihood does not contain the shape of the hazard over time.

Products of likelihoods

The likelihood, given two or more independent events, is the product of the likelihoods of each of the individual events:

\Lambda (A\mid X_{1}\land X_{2})=\Lambda (A\mid X_{1})\cdot \Lambda (A\mid X_{2}).

This follows from the definition of independence in probability: the probabilities of two independent events happening, given a model, is the product of the probabilities.

This is particularly important when the events are from

sampling with replacement

. In such a situation, the likelihood function factors into a product of individual likelihood functions.

The empty product has value 1, which corresponds to the likelihood, given no event, being 1: before any data, the likelihood is always 1. This is similar to a

improper prior

because likelihoods are not integrated.

Log-likelihood

Log-likelihood function is the logarithm of the likelihood function, often denoted by a lowercase $l$ or $\ell$ , to contrast with the uppercase $L$ or ${\mathcal {L}}$ for the likelihood. Because logarithms are

objective function plays a key role in the maximization

.

Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overall

surprisal

, the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model.

A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods:

\log {\frac {{\mathcal {L}}(A)}{{\mathcal {L}}(B)}}=\log {\mathcal {L}}(A)-\log {\mathcal {L}}(B)=\ell (A)-\ell (B).

Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models.

Graph

The graph of the log-likelihood is called the support curve (in the univariate case).^[36] In the multivariate case, the concept generalizes into a support surface over the parameter space. It has a relation to, but is distinct from, the support of a distribution.

The term was coined by

statistical hypothesis testing

, i.e. whether or not the data "support" one hypothesis (or parameter value) being tested more than any other.

The log-likelihood function being plotted is used in the computation of the

score (the gradient of the log-likelihood) and Fisher information (the curvature of the log-likelihood). Thus, the graph has a direct interpretation in the context of maximum likelihood estimation and likelihood-ratio tests

.

Likelihood equations

If the log-likelihood function is

score

and written

s_{n}(\theta )\equiv \nabla _{\theta }\ell _{n}(\theta )

, exists and allows for the application of differential calculus. The basic way to maximize a differentiable function is to find the stationary points (the points where the derivative is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires the product rule, it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events.

The equations defined by the stationary point of the score function serve as estimating equations for the maximum likelihood estimator.

s_{n}(\theta )=\mathbf {0}

In that sense, the maximum likelihood estimator is implicitly defined by the value at

\mathbf {0}

of the inverse function

s_{n}^{-1}:\mathbb {E} ^{d}\to \Theta

, where

\mathbb {E} ^{d}

is the d-dimensional Euclidean space, and

\Theta

is the parameter space. Using the inverse function theorem, it can be shown that

s_{n}^{-1}

is

open neighborhood

about

\mathbf {0}

with probability going to one, and

{\hat {\theta }}_{n}=s_{n}^{-1}(\mathbf {0} )

is a consistent estimate of

\theta

. As a consequence there exists a sequence

{\textstyle \left\{{\hat {\theta }}_{n}\right\}}

such that

s_{n}({\hat {\theta }}_{n})=\mathbf {0}

asymptotically almost surely, and

{\hat {\theta }}_{n}{\xrightarrow {\text{p}}}\theta _{0}

.^[37] A similar result can be established using Rolle's theorem.^[38]^[39]

The second derivative evaluated at ${\hat {\theta }}$ , known as Fisher information, determines the curvature of the likelihood surface,^[40] and thus indicates the precision of the estimate.^[41]

Exponential families

The log-likelihood is also particularly useful for

exponential families of distributions, which include many of the common parametric probability distributions. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involving exponentiation

. The logarithm of such a function is a sum of products, again easier to differentiate than the original function.

An exponential family is one whose probability density function is of the form (for some functions, writing $\langle -,-\rangle$ for the

inner product

):

p(x\mid {\boldsymbol {\theta }})=h(x)\exp {\Big (}\langle {\boldsymbol {\eta }}({\boldsymbol {\theta }}),\mathbf {T} (x)\rangle -A({\boldsymbol {\theta }}){\Big )}.

Each of these terms has an interpretation,[a] but simply switching from probability to likelihood and taking logarithms yields the sum:

\ell ({\boldsymbol {\theta }}\mid x)=\langle {\boldsymbol {\eta }}({\boldsymbol {\theta }}),\mathbf {T} (x)\rangle -A({\boldsymbol {\theta }})+\log h(x).

The ${\boldsymbol {\eta }}({\boldsymbol {\theta }})$ and $h(x)$ each correspond to a

change of coordinates

, so in these coordinates, the log-likelihood of an exponential family is given by the simple formula:

\ell ({\boldsymbol {\eta }}\mid x)=\langle {\boldsymbol {\eta }},\mathbf {T} (x)\rangle -A({\boldsymbol {\eta }}).

In words, the log-likelihood of an exponential family is inner product of the natural parameter ${\boldsymbol {\eta }}$ and the sufficient statistic $\mathbf {T} (x)$ , minus the normalization factor (

log-partition function

)

A({\boldsymbol {\eta }})

. Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic

T

and the log-partition function

A

.

Example: the gamma distribution

The gamma distribution is an exponential family with two parameters, $\alpha$ and $\beta$ . The likelihood function is

{\mathcal {L}}(\alpha ,\beta \mid x)={\frac {\beta ^{\alpha }}{\Gamma (\alpha )}}x^{\alpha -1}e^{-\beta x}.

Finding the maximum likelihood estimate of $\beta$ for a single observed value $x$ looks rather daunting. Its logarithm is much simpler to work with:

\log {\mathcal {L}}(\alpha ,\beta \mid x)=\alpha \log \beta -\log \Gamma (\alpha )+(\alpha -1)\log x-\beta x.\,

To maximize the log-likelihood, we first take the partial derivative with respect to $\beta$ :

{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x)}{\partial \beta }}={\frac {\alpha }{\beta }}-x.

If there are a number of independent observations $x_{1},\ldots ,x_{n}$ , then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood:

{\begin{aligned}&{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{1},\ldots ,x_{n})}{\partial \beta }}\\={}&{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{1})}{\partial \beta }}+\cdots +{\frac {\partial \log {\mathcal {L}}(\alpha ,\beta \mid x_{n})}{\partial \beta }}={\frac {n\alpha }{\beta }}-\sum _{i=1}^{n}x_{i}.\end{aligned}}

To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved for $\beta$ :

{\widehat {\beta }}={\frac {\alpha }{\bar {x}}}.

Here ${\widehat {\beta }}$ denotes the maximum-likelihood estimate, and $\textstyle {\bar {x}}={\frac {1}{n}}\sum _{i=1}^{n}x_{i}$ is the

sample mean

of the observations.

Background and interpretation

Historical remarks

The term "likelihood" has been in use in English since at least late

method of maximum likelihood

". Quoting Fisher:

[I]n 1922, I proposed the term 'likelihood,' in view of the fact that, with respect to [the parameter], it is not a probability, and does not obey the laws of probability, while at the same time it bears to the problem of rational choice among the possible values of [the parameter] a relation similar to that which probability bears to the problem of predicting events in games of chance. . . . Whereas, however, in relation to psychological judgment, likelihood has some resemblance to probability, the two concepts are wholly distinct. . . ."^[46]

The concept of likelihood should not be confused with probability as mentioned by Sir Ronald Fisher

I stress this because in spite of the emphasis that I have always laid upon the difference between probability and likelihood there is still a tendency to treat likelihood as though it were a sort of probability. The first result is thus that there are two different measures of rational belief appropriate to different cases. Knowing the population we can express our incomplete knowledge of, or expectation of, the sample in terms of probability; knowing the sample we can express our incomplete knowledge of the population in terms of likelihood.^[47]

Fisher's invention of statistical likelihood was in reaction against an earlier form of reasoning called inverse probability.^[48] His use of the term "likelihood" fixed the meaning of the term within mathematical statistics.

A. W. F. Edwards (1972) established the axiomatic basis for use of the log-likelihood ratio as a measure of relative support for one hypothesis against another. The support function is then the natural logarithm of the likelihood function. Both terms are used in phylogenetics, but were not adopted in a general treatment of the topic of statistical evidence.^[49]

Interpretations under different foundations

Among statisticians, there is no consensus about what the

likelihoodism, and AIC-based.^[50]

For each of the proposed foundations, the interpretation of likelihood is different. The four interpretations are described in the subsections below.

Frequentist interpretation

Bayesian interpretation

In

Bayes' Rule, the likelihood when seen as a conditional density can be multiplied by the prior probability density of the parameter and then normalized, to give a posterior probability density.^[51]^[52]^[53]^[54]^[55]

More generally, the likelihood of an unknown quantity

X

given another unknown quantity

Y

is proportional to the probability of $Y$ given $X$ .[51]^[52]^[53]^[54]^[55]

Likelihoodist interpretation

In frequentist statistics, the likelihood function is itself a statistic that summarizes a single sample from a population, whose calculated value depends on a choice of several parameters θ₁ ... θ_p, where p is the count of parameters in some already-selected statistical model. The value of the likelihood serves as a figure of merit for the choice used for the parameters, and the parameter set with maximum likelihood is the best choice, given the data available.

The specific calculation of the likelihood is the probability that the observed sample would be assigned, assuming that the model chosen and the values of the several parameters θ give an accurate approximation of the

frequency distribution of the population that the observed sample was drawn from. Heuristically, it makes sense that a good choice of parameters is those which render the sample actually observed the maximum possible post-hoc probability of having happened. Wilks' theorem quantifies the heuristic rule by showing that the difference in the logarithm of the likelihood generated by the estimate's parameter values and the logarithm of the likelihood generated by population's "true" (but unknown) parameter values is asymptotically χ² distributed

.

Each independent sample's maximum likelihood estimate is a separate estimate of the "true" parameter set describing the population sampled. Successive estimates from many independent samples will cluster together with the population's "true" set of parameter values hidden somewhere in their midst. The difference in the logarithms of the maximum likelihood and adjacent parameter sets' likelihoods may be used to draw a confidence region on a plot whose co-ordinates are the parameters θ₁ ... θ_p. The region surrounds the maximum-likelihood estimate, and all points (parameter sets) within that region differ at most in log-likelihood by some fixed value. The χ² distribution given by Wilks' theorem converts the region's log-likelihood differences into the "confidence" that the population's "true" parameter set lies inside. The art of choosing the fixed log-likelihood difference is to make the confidence acceptably high while keeping the region acceptably small (narrow range of estimates).

As more data are observed, instead of being used to make independent estimates, they can be combined with the previous samples to make a single combined sample, and that large sample may be used for a new maximum likelihood estimate. As the size of the combined sample increases, the size of the likelihood region with the same confidence shrinks. Eventually, either the size of the confidence region is very nearly a single point, or the entire population has been sampled; in both cases, the estimated parameter set is essentially the same as the population parameter set.

AIC-based interpretation

Under the AIC paradigm, likelihood is interpreted within the context of information theory.^[57]^[58]^[59]

Notes

^ See Exponential family § Interpretation

References

ISBN 0-534-24312-6
.

ISBN 978-1-4419-0925-1
.

ISBN 0-387-98502-6
.

ISBN 0-471-98165-6
.

John Wiley & Sons
. pp. 422–423.
^ Shao, Jun (2003). Mathematical Statistics (2nd ed.). Springer. §4.4.1.

ISBN 0-521-40551-3
.

JSTOR 2240844
.

S2CID 15896597
.

JSTOR 2333005
.

ISBN 0-471-09077-8
.

doi:10.1111/j.2517-6161.1979.tb01071.x
.

doi:10.1111/j.2517-6161.1985.tb01384.x
.

ISBN 0-444-88376-2
.

doi:10.1080/00031305.1982.10482817
.

^ ^a ^b ^c ^d Kalbfleisch, J. G. (1985), Probability and Statistical Inference, Springer (§9.3).

ISBN 9780412606502
(§1.4.2).

^ ^a ^b ^c Sprott, D. A. (2000), Statistical Inference in Science, Springer (chap. 2).

^ Davison, A. C. (2008), Statistical Models, Cambridge University Press (§4.1.2).

^ Held, L.; Sabanés Bové, D. S. (2014), Applied Statistical Inference—Likelihood and Bayes, Springer (§2.1).

^ ^a ^b ^c Rossi, R. J. (2018), Mathematical Statistics, Wiley, p. 267.

^
Journal of the Royal Statistical Society, Series B
, 33 (2): 256–262.

^ Pawitan, Yudi (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press.

^ Wen Hsiang Wei. "Generalized Linear Model - course notes". Taichung, Taiwan: Tunghai University. pp. Chapter 5. Retrieved 2017-10-01.

ISBN 978-0-674-00560-0
.

ISBN 978-0-19-506011-9
.

ISBN 978-0-521-40551-5
.

ISBN 0-86094-190-6
.

ISBN 978-0-691-12522-0
.

ISBN 0-387-90777-7
.

JSTOR 2347496
.

JSTOR 25049882
.

^
MR 0400509
.

ISBN 0-471-82668-5
.

^ Papadopoulos, Alecos (September 25, 2013). "Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)?". Stack Exchange.

^
ISBN 0-8018-4443-6
.

doi:10.1080/01621459.1977.10479926
.

doi:10.1080/01621459.1975.10480321
.

doi:10.1080/03610928208828325
.

doi:10.1093/biomet/47.1-2.203
.

^ Ward, Michael D.; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. Cambridge University Press. pp. 25–27.

^ "likelihood", Shorter Oxford English Dictionary (2007).

JSTOR 2676741
.

^ Fisher, R.A. (1921). "On the "probable error" of a coefficient of correlation deduced from a small sample". Metron. 1: 3–32.

JSTOR 91208
.

^ Klemens, Ben (2008). Modeling with Data: Tools and Techniques for Scientific Computing. Princeton University Press. p. 329.

doi:10.1017/S0305004100016297
.

doi:10.1214/ss/1030037905
.

^ Royall, R. (1997). Statistical Evidence. Chapman & Hall.

North-Holland Publishing
.

^ ^a ^b ^c ^d I. J. Good: Probability and the Weighing of Evidence (Griffin 1950), §6.1

^ ^a ^b ^c ^d H. Jeffreys: Theory of Probability (3rd ed., Oxford University Press 1983), §1.22

^ ^a ^b ^c ^d ^e E. T. Jaynes: Probability Theory: The Logic of Science (Cambridge University Press 2003), §4.1

^ ^a ^b ^c ^d D. V. Lindley: Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 1: Probability (Cambridge University Press 1980), §1.6

^ ^a ^b ^c ^d A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin: Bayesian Data Analysis (3rd ed., Chapman & Hall/CRC 2014), §1.3

ISBN 9781118341544

^ Akaike, H. (1985). "Prediction and entropy". In Atkinson, A. C.; Fienberg, S. E. (eds.). A Celebration of Statistics. Springer. pp. 1–24.

^ Sakamoto, Y.; Ishiguro, M.; Kitagawa, G. (1986). Akaike Information Criterion Statistics. D. Reidel. Part I.

Springer-Verlag
. chap. 7.

Further reading

Azzalini, Adelchi (1996). "Likelihood". Statistical Inference Based on the Likelihood. Chapman and Hall. pp. 17–50.
ISBN 0-412-60650-X
.

Boos, Dennis D.; Stefanski, L. A. (2013). "Likelihood Construction and Estimation". Essential Statistical Inference : Theory and Methods. New York: Springer. pp. 27–124.
ISBN 978-1-4614-4817-4
.

ISBN 0-8018-4443-6
.

ISBN 0-521-36697-6
.

Richard, Mark; Vecer, Jan (1 February 2021). "Efficiency Testing of Prediction Markets: Martingale Approach, Likelihood Ratio and Bayes Factor Analysis". Risks. 9 (2): 31.
doi:10.3390/risks9020031
.

Lindsey, J. K. (1996). "Likelihood". Parametric Statistical Inference. Oxford University Press. pp. 69–139.
ISBN 0-19-852359-9
.

Rohde, Charles A. (2014). Introductory Statistical Inference with the Likelihood Function. Berlin: Springer.
ISBN 978-3-319-10460-7
.

Royall, Richard (1997). Statistical Evidence : A Likelihood Paradigm. London: Chapman & Hall.
ISBN 0-412-04411-0
.

ISBN 978-1-316-63682-4
.

External links

Look up likelihood in Wiktionary, the free dictionary.

Likelihood function at Planetmath

"Log-likelihood". Statlect.

v
t
e
Statistics

Outline

Index

Continuous data
Center

Mean
Arithmetic

Arithmetic-Geometric

Cubic

Generalized/power

Geometric

Harmonic

Heronian

Heinz

Lehmer

Median

Mode

Dispersion

Average absolute deviation

Coefficient of variation

Interquartile range

Percentile

Range

Standard deviation

Variance

Shape

Central limit theorem

Moments
Kurtosis

L-moments

Skewness

Count data

Index of dispersion

Summary tables

Contingency table

Frequency distribution

Grouped data

Dependence

Partial correlation

Pearson product-moment correlation

Rank correlation
Kendall's τ

Spearman's ρ

Scatter plot

Graphics

Bar chart

Biplot

Box plot

Control chart

Correlogram

Fan chart

Forest plot

Histogram

Pie chart

Q–Q plot

Radar chart

Run chart

Scatter plot

Stem-and-leaf display

Violin plot

Data collection
Study design

Effect size

Missing data

Optimal design

Population

Replication

Sample size determination

Statistic

Statistical power

Survey methodology

Sampling
Cluster

Stratified

Opinion poll

Questionnaire

Standard error

Controlled experiments

Blocking

Factorial experiment

Interaction

Random assignment

Randomized controlled trial

Randomized experiment

Scientific control

Adaptive designs

Adaptive clinical trial

Stochastic approximation

Up-and-down designs

Observational studies

Cohort study

Cross-sectional study

Natural experiment

Quasi-experiment

Statistical inference
Statistical theory

Population

Statistic

Probability distribution

Sampling distribution
Order statistic

Empirical distribution
Density estimation

Statistical model
Model specification

L^p space

Parameter
location

scale

shape

Parametric family
Likelihood (monotone)

Location–scale family

Exponential family

Completeness

Sufficiency

Statistical functional

Bootstrap

U

V

Optimal decision
loss function

Efficiency

Statistical distance
divergence

Asymptotics

Robustness

Frequentist inference
Point estimation

Estimating equations
Maximum likelihood

Method of moments

M-estimator

Minimum distance

Unbiased estimators
Mean-unbiased minimum-variance
Rao–Blackwellization

Lehmann–Scheffé theorem

Median unbiased

Plug-in

Interval estimation

Confidence interval

Pivot

Likelihood interval

Prediction interval

Tolerance interval

Resampling
Bootstrap

Jackknife

Testing hypotheses

1- & 2-tails

Power

Uniformly most powerful test

Permutation test
Randomization test

Multiple comparisons

Parametric tests

Likelihood-ratio

Score/Lagrange multiplier

Wald

Specific tests

Z-test (normal)

Student's t-test

F-test

Goodness of fit

Chi-squared

G-test

Kolmogorov–Smirnov

Anderson–Darling

Lilliefors

Jarque–Bera

Normality (Shapiro–Wilk)

Likelihood-ratio test

Model selection
Cross validation

AIC

BIC

Rank statistics

Sign
Sample median

Signed rank (Wilcoxon)
Hodges–Lehmann estimator

Rank sum (Mann–Whitney)

Nonparametric anova
1-way (Kruskal–Wallis)

2-way (Friedman)

Ordered alternative (Jonckheere–Terpstra)

Van der Waerden test

Bayesian inference

Bayesian probability
prior

posterior

Credible interval

Bayes factor

Bayesian estimator
Maximum posterior estimator

Correlation

Pearson product-moment

Partial correlation

Confounding variable

Coefficient of determination

Regression analysis

Errors and residuals

Regression validation

Mixed effects models

Simultaneous equations models

Multivariate adaptive regression splines (MARS)

Linear regression

Simple linear regression

Ordinary least squares

General linear model

Bayesian regression

Non-standard predictors

Nonlinear regression

Nonparametric

Semiparametric

Isotonic

Robust

Heteroscedasticity

Homoscedasticity

Generalized linear model

Exponential families

Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance

Analysis of variance (ANOVA, anova)

Analysis of covariance

Multivariate ANOVA

Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis
Categorical

Cohen's kappa

Contingency table

Graphical model

Log-linear model

McNemar's test

Cochran–Mantel–Haenszel statistics

Multivariate

Regression

Manova

Principal components

Canonical correlation

Discriminant analysis

Cluster analysis

Classification

Structural equation model
Factor analysis

Multivariate distributions

Elliptical distributions
Normal

Time-series
General

Decomposition

Trend

Stationarity

Seasonal adjustment

Exponential smoothing

Cointegration

Structural break

Granger causality

Specific tests

Dickey–Fuller

Johansen

Q-statistic (Ljung–Box)

Durbin–Watson

Breusch–Godfrey

Time domain

Autocorrelation (ACF)
partial (PACF)

Cross-correlation (XCF)

ARMA model

ARIMA model (Box–Jenkins)

Autoregressive conditional heteroskedasticity (ARCH)

Vector autoregression (VAR)

Frequency domain

Spectral density estimation

Fourier analysis

Least-squares spectral analysis

Wavelet

Whittle likelihood

Survival
Survival function

Kaplan–Meier estimator (product limit)

Proportional hazards models

Accelerated failure time (AFT) model

First hitting time

Hazard function

Nelson–Aalen estimator

Test

Log-rank test

Applications
Biostatistics

Bioinformatics

Clinical trials / studies

Epidemiology

Medical statistics

Engineering statistics

Chemometrics

Methods engineering

Probabilistic design

Process / quality control

Reliability

System identification

Social statistics

Actuarial science

Census

Crime statistics

Demography

Econometrics

Jurimetrics

National accounts

Official statistics

Population statistics

Psychometrics

Spatial statistics

Cartography

Environmental statistics

Geographic information system

Geostatistics

Kriging

Category

Mathematics portal

Commons

WikiProject

Portal:
Mathematics

Retrieved from "https://en.wikipedia.org/w/index.php?title=Likelihood_function&oldid=1219191465"

[42] See Exponential family § Interpretation

[1] ISBN 0-534-24312-6
.

[2] ISBN 978-1-4419-0925-1
.

[3] ISBN 0-387-98502-6
.

[4] ISBN 0-471-98165-6
.

[5] John Wiley & Sons
. pp. 422–423.

[Shao03-6] Shao, Jun (2003). Mathematical Statistics (2nd ed.). Springer. §4.4.1.

[7] ISBN 0-521-40551-3
.

[8] JSTOR 2240844
.

[9] S2CID 15896597
.

[10] JSTOR 2333005
.

[11] ISBN 0-471-09077-8
.

[12] :10.1111/j.2517-6161.1979.tb01071.x
.

[13] :10.1111/j.2517-6161.1985.tb01384.x
.

[14] ISBN 0-444-88376-2
.

[15] :10.1080/00031305.1982.10482817
.

[Kalbfleisch-16] Kalbfleisch, J. G. (1985), Probability and Statistical Inference, Springer (§9.3).

[17] ISBN 9780412606502
(§1.4.2).

[Sprott-18] Sprott, D. A. (2000), Statistical Inference in Science, Springer (chap. 2).

[19] Davison, A. C. (2008), Statistical Models, Cambridge University Press (§4.1.2).

[20] Held, L.; Sabanés Bové, D. S. (2014), Applied Statistical Inference—Likelihood and Bayes, Springer (§2.1).

[Rossi2018-21] Rossi, R. J. (2018), Mathematical Statistics, Wiley, p. 267.

[Hudson-22] 
Journal of the Royal Statistical Society, Series B
, 33 (2): 256–262.

[23] Pawitan, Yudi (2001). In All Likelihood: Statistical Modelling and Inference Using Likelihood. Oxford University Press.

[24] Wen Hsiang Wei. "Generalized Linear Model - course notes". Taichung, Taiwan: Tunghai University. pp. Chapter 5. Retrieved 2017-10-01.

[25] ISBN 978-0-674-00560-0
.

[26] ISBN 978-0-19-506011-9
.

[27] ISBN 978-0-521-40551-5
.

[28] ISBN 0-86094-190-6
.

[29] ISBN 978-0-691-12522-0
.

[30] ISBN 0-387-90777-7
.

[31] JSTOR 2347496
.

[32] JSTOR 25049882
.

[33] 
MR 0400509
.

[34] ISBN 0-471-82668-5
.

[35] Papadopoulos, Alecos (September 25, 2013). "Why we always put log() before the joint pdf when we use MLE (Maximum likelihood Estimation)?". Stack Exchange.

[Edwards72-36] 
ISBN 0-8018-4443-6
.

[37] :10.1080/01621459.1977.10479926
.

[38] :10.1080/01621459.1975.10480321
.

[39] :10.1080/03610928208828325
.

[40] :10.1093/biomet/47.1-2.203
.

[41] Ward, Michael D.; Ahlquist, John S. (2018). Maximum Likelihood for Social Science : Strategies for Analysis. Cambridge University Press. pp. 25–27.

[43] "likelihood", Shorter Oxford English Dictionary (2007).

[44] JSTOR 2676741
.

[45] Fisher, R.A. (1921). "On the "probable error" of a coefficient of correlation deduced from a small sample". Metron. 1: 3–32.

[Fisher1922-46] JSTOR 91208
.

[47] Klemens, Ben (2008). Modeling with Data: Tools and Techniques for Scientific Computing. Princeton University Press. p. 329.

[48] :10.1017/S0305004100016297
.

[49] :10.1214/ss/1030037905
.

[50] Royall, R. (1997). Statistical Evidence. Chapman & Hall.

[BF11-51] North-Holland Publishing
.

[good1950-52] I. J. Good: Probability and the Weighing of Evidence (Griffin 1950), §6.1

[jeffreys1983-53] H. Jeffreys: Theory of Probability (3rd ed., Oxford University Press 1983), §1.22

[jaynes2003-54] E. T. Jaynes: Probability Theory: The Logic of Science (Cambridge University Press 2003), §4.1

[lindley1980-55] D. V. Lindley: Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 1: Probability (Cambridge University Press 1980), §1.6

[gelmanetal2014-56] A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin: Bayesian Data Analysis (3rd ed., Chapman & Hall/CRC 2014), §1.3

[57] ISBN 9781118341544

[58] Akaike, H. (1985). "Prediction and entropy". In Atkinson, A. C.; Fienberg, S. E. (eds.). A Celebration of Statistics. Springer. pp. 1–24.

[59] Sakamoto, Y.; Ishiguro, M.; Kitagawa, G. (1986). Akaike Information Criterion Statistics. D. Reidel. Part I.

[60] Springer-Verlag
. chap. 7.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[30]

[31]

[32]

[33]

[36]

[37]

[38]

[39]

[40]

[41]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[57]

[58]

[59]

Definition

Discrete probability distribution

Example

Continuous probability distribution

Relationship between the likelihood and probability density functions

In general

Likelihoods for mixed continuous–discrete distributions

Regularity conditions

Likelihood ratio and relative likelihood

Likelihood ratio

Relative likelihood function

Likelihood region

Likelihoods that eliminate nuisance parameters

Profile likelihood

Conditional likelihood

Marginal likelihood

Partial likelihood

Products of likelihoods

Log-likelihood

Graph

Likelihood equations

Exponential families

Example: the gamma distribution

Background and interpretation

Historical remarks

Interpretations under different foundations

Frequentist interpretation

Bayesian interpretation

Likelihoodist interpretation

AIC-based interpretation

See also

Notes

References

Further reading

External links