Bayesian inference

Bayesian inference (/ˈbeɪziən/ BAY-zee-ən or /ˈbeɪʒən/ BAY-zhən)^[1] is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian inference uses a prior distribution to estimate posterior probabilities. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Introduction to Bayes' rule

Formal explanation

Contingency table
Hypothesis Evidence	Satisfies hypothesis $H$	Violates hypothesis ⁠ $\neg H$ ⁠	Total
Has evidence $E$	$P(H\|E)\cdot P(E)$ $=P(E\|H)\cdot P(H)$	$P(\neg H\|E)\cdot P(E)$ $=P(E\|\neg H)\cdot P(\neg H)$	⁠ $P(E)$ ⁠
No evidence ⁠ $\neg E$ ⁠	$P(H\|\neg E)\cdot P(\neg E)$ $=P(\neg E\|H)\cdot P(H)$	$P(\neg H\|\neg E)\cdot P(\neg E)$ $=P(\neg E\|\neg H)\cdot P(\neg H)$	$P(\neg E)$ = $1-P(E)$

Total	⁠ $P(H)$ ⁠	$P(\neg H)=1-P(H)$	1

Bayesian inference derives the

consequence of two antecedents: a prior probability and a "likelihood function" derived from a statistical model for the observed data. Bayesian inference computes the posterior probability according to Bayes' theorem

:

P(H\mid E)={\frac {P(E\mid H)\cdot P(H)}{P(E)}},

where

$H$ stands for any hypothesis whose probability may be affected by data (called evidence below). Often there are competing hypotheses, and the task is to determine which is the most probable.
$P(H)$ , the prior probability, is the estimate of the probability of the hypothesis $H$ before the data $E$ , the current evidence, is observed.
$E$ , the evidence, corresponds to new data that were not used in computing the prior probability.
$P(H\mid E)$ , the posterior probability, is the probability of $H$ given $E$ , i.e., after $E$ is observed. This is what we want to know: the probability of a hypothesis given the observed evidence.
$P(E\mid H)$ is the probability of observing $E$ given $H$ and is called the likelihood. As a function of $E$ with $H$ fixed, it indicates the compatibility of the evidence with the given hypothesis. The likelihood function is a function of the evidence, $E$ , while the posterior probability is a function of the hypothesis, $H$ .
$P(E)$ is sometimes termed the marginal likelihood or "model evidence". This factor is the same for all possible hypotheses being considered (as is evident from the fact that the hypothesis $H$ does not appear anywhere in the symbol, unlike for all the other factors) and hence does not factor into determining the relative probabilities of different hypotheses.
$P(E)>0$ (Else one has $0/0$ .)

For different values of $H$ , only the factors $P(H)$ and $P(E\mid H)$ , both in the numerator, affect the value of $P(H\mid E)$ – the posterior probability of a hypothesis is proportional to its prior probability (its inherent likeliness) and the newly acquired likelihood (its compatibility with the new observed evidence).

In cases where $\neg H$ ("not $H$ "), the

logical negation

of

H

, is a valid likelihood, Bayes' rule can be rewritten as follows:

{\begin{aligned}P(H\mid E)&={\frac {P(E\mid H)P(H)}{P(E)}}\\\\&={\frac {P(E\mid H)P(H)}{P(E\mid H)P(H)+P(E\mid \neg H)P(\neg H)}}\\\\&={\frac {1}{1+\left({\frac {1}{P(H)}}-1\right){\frac {P(E\mid \neg H)}{P(E\mid H)}}}}\\\end{aligned}}

because

P(E)=P(E\mid H)P(H)+P(E\mid \neg H)P(\neg H)

and

P(H)+P(\neg H)=1.

This focuses attention on the term

\left({\tfrac {1}{P(H)}}-1\right){\tfrac {P(E\mid \neg H)}{P(E\mid H)}}.

If that term is approximately 1, then the probability of the hypothesis given the evidence,

P(H\mid E)

, is about

{\tfrac {1}{2}}

, about 50% likely - equally likely or not likely. If that term is very small, close to zero, then the probability of the hypothesis, given the evidence,

P(H\mid E)

is close to 1 or the conditional hypothesis is quite likely. If that term is very large, much larger than 1, then the hypothesis, given the evidence, is quite unlikely. If the hypothesis (without consideration of evidence) is unlikely, then

P(H)

is small (but not necessarily astronomically small) and

{\tfrac {1}{P(H)}}

is much larger than 1 and this term can be approximated as

{\tfrac {P(E\mid \neg H)}{P(E\mid H)\cdot P(H)}}

and relevant probabilities can be compared directly to each other.

One quick and easy way to remember the equation would be to use rule of multiplication: $P(E\cap H)=P(E\mid H)P(H)=P(H\mid E)P(E).$

Alternatives to Bayesian updating

Bayesian updating is widely used and computationally convenient. However, it is not the only updating rule that might be considered rational.

Dutch book" arguments did not specify Bayesian updating: they left open the possibility that non-Bayesian updating rules could avoid Dutch books. Hacking wrote:^[2]

"And neither the Dutch book argument nor any other in the personalist arsenal of proofs of the probability axioms entails the dynamic assumption. Not one entails Bayesianism. So the personalist requires the dynamic assumption to be Bayesian. It is true that in consistency a personalist could abandon the Bayesian model of learning from experience. Salt could lose its savour."

Indeed, there are non-Bayesian updating rules that also avoid Dutch books (as discussed in the literature on "

Richard C. Jeffrey's rule, which applies Bayes' rule to the case where the evidence itself is assigned a probability.^[3] The additional hypotheses needed to uniquely require Bayesian updating have been deemed to be substantial, complicated, and unsatisfactory.^[4]

Inference over exclusive and exhaustive possibilities

If evidence is simultaneously used to update belief over a set of exclusive and exhaustive propositions, Bayesian inference may be thought of as acting on this belief distribution as a whole.

General formulation

Suppose a process is generating independent and identically distributed events $E_{n},\ n=1,2,3,\ldots$ , but the probability distribution is unknown. Let the event space $\Omega$ represent the current state of belief for this process. Each model is represented by event $M_{m}$ . The conditional probabilities $P(E_{n}\mid M_{m})$ are specified to define the models. $P(M_{m})$ is the degree of belief in $M_{m}$ . Before the first inference step, $\{P(M_{m})\}$ is a set of initial prior probabilities. These must sum to 1, but are otherwise arbitrary.

Suppose that the process is observed to generate $E\in \{E_{n}\}$ . For each $M\in \{M_{m}\}$ , the prior $P(M)$ is updated to the posterior $P(M\mid E)$ . From Bayes' theorem:^[5]

$P(M\mid E)={\frac {P(E\mid M)}{\sum _{m}{P(E\mid M_{m})P(M_{m})}}}\cdot P(M).$

Upon observation of further evidence, this procedure may be repeated.

Multiple observations

For a sequence of

independent and identically distributed

observations

\mathbf {E} =(e_{1},\dots ,e_{n})

, it can be shown by induction that repeated application of the above is equivalent to

P(M\mid \mathbf {E} )={\frac {P(\mathbf {E} \mid M)}{\sum _{m}{P(\mathbf {E} \mid M_{m})P(M_{m})}}}\cdot P(M),

where

P(\mathbf {E} \mid M)=\prod _{k}{P(e_{k}\mid M)}.

Parametric formulation: motivating the formal description

By parameterizing the space of models, the belief in all models may be updated in a single step. The distribution of belief over the model space may then be thought of as a distribution of belief over the parameter space. The distributions in this section are expressed as continuous, represented by probability densities, as this is the usual situation. The technique is, however, equally applicable to discrete distributions.

Let the vector ${\boldsymbol {\theta }}$ span the parameter space. Let the initial prior distribution over ${\boldsymbol {\theta }}$ be $p({\boldsymbol {\theta }}\mid {\boldsymbol {\alpha }})$ , where ${\boldsymbol {\alpha }}$ is a set of parameters to the prior itself, or hyperparameters. Let $\mathbf {E} =(e_{1},\dots ,e_{n})$ be a sequence of independent and identically distributed event observations, where all $e_{i}$ are distributed as $p(e\mid {\boldsymbol {\theta }})$ for some ${\boldsymbol {\theta }}$ .

posterior distribution

over

{\boldsymbol {\theta }}

:

${\begin{aligned}p({\boldsymbol {\theta }}\mid \mathbf {E} ,{\boldsymbol {\alpha }})&={\frac {p(\mathbf {E} \mid {\boldsymbol {\theta }},{\boldsymbol {\alpha }})}{p(\mathbf {E} \mid {\boldsymbol {\alpha }})}}\cdot p({\boldsymbol {\theta }}\mid {\boldsymbol {\alpha }})\\&={\frac {p(\mathbf {E} \mid {\boldsymbol {\theta }},{\boldsymbol {\alpha }})}{\int p(\mathbf {E} \mid {\boldsymbol {\theta }},{\boldsymbol {\alpha }})p({\boldsymbol {\theta }}\mid {\boldsymbol {\alpha }})\,d{\boldsymbol {\theta }}}}\cdot p({\boldsymbol {\theta }}\mid {\boldsymbol {\alpha }}),\end{aligned}}$ where $p(\mathbf {E} \mid {\boldsymbol {\theta }},{\boldsymbol {\alpha }})=\prod _{k}p(e_{k}\mid {\boldsymbol {\theta }}).$

Formal description of Bayesian inference

Definitions

$x$ , a data point in general. This may in fact be a
vector
of values.
$\theta$ , the parameter of the data point's distribution, i.e., $x\sim p(x\mid \theta )$ . This may be a
vector
of parameters.
$\alpha$ , the hyperparameter of the parameter distribution, i.e., $\theta \sim p(\theta \mid \alpha )$ . This may be a
vector
of hyperparameters.
$\mathbf {X}$ is the sample, a set of $n$ observed data points, i.e., $x_{1},\ldots ,x_{n}$ .
${\tilde {x}}$ , a new data point whose distribution is to be predicted.

Bayesian inference

The
prior distribution
is the distribution of the parameter(s) before any data is observed, i.e. $p(\theta \mid \alpha )$ . The prior distribution might not be easily determined; in such a case, one possibility may be to use the Jeffreys prior to obtain a prior distribution before updating it with newer observations.
The sampling distribution is the distribution of the observed data conditional on its parameters, i.e. $p(\mathbf {X} \mid \theta )$ . This is also termed the likelihood, especially when viewed as a function of the parameter(s), sometimes written $\operatorname {L} (\theta \mid \mathbf {X} )=p(\mathbf {X} \mid \theta )$ .
The marginal likelihood (sometimes also termed the evidence) is the distribution of the observed data marginalized over the parameter(s), i.e. $p(\mathbf {X} \mid \alpha )=\int p(\mathbf {X} \mid \theta )p(\theta \mid \alpha )d\theta .$ It quantifies the agreement between data and expert opinion, in a geometric sense that can be made precise.^[6] If the marginal likelihood is 0 then there is no agreement between the data and expert opinion and Bayes' rule cannot be applied.
The
Bayes' rule
, which forms the heart of Bayesian inference: $p(\theta \mid \mathbf {X} ,\alpha )={\frac {p(\theta ,\mathbf {X} ,\alpha )}{p(\mathbf {X} ,\alpha )}}={\frac {p(\mathbf {X} \mid \theta ,\alpha )p(\theta ,\alpha )}{p(\mathbf {X} \mid \alpha )p(\alpha )}}={\frac {p(\mathbf {X} \mid \theta ,\alpha )p(\theta \mid \alpha )}{p(\mathbf {X} \mid \alpha )}}\propto p(\mathbf {X} \mid \theta ,\alpha )p(\theta \mid \alpha ).$ This is expressed in words as "posterior is proportional to likelihood times prior", or sometimes as "posterior = likelihood times prior, over evidence".
In practice, for almost all complex Bayesian models used in machine learning, the posterior distribution $p(\theta \mid \mathbf {X} ,\alpha )$ is not obtained in a closed form distribution, mainly because the parameter space for $\theta$ can be very high, or the Bayesian model retains certain hierarchical structure formulated from the observations $\mathbf {X}$ and parameter $\theta$ . In such situations, we need to resort to approximation techniques.^[7]
General case: Let $P_{Y}^{x}$ be the conditional distribution of $Y$ given $X=x$ and let $P_{X}$ be the distribution of $X$ . The joint distribution is then $P_{X,Y}(dx,dy)=P_{Y}^{x}(dy)P_{X}(dx)$ . The conditional distribution $P_{X}^{y}$ of $X$ given $Y=y$ is then determined by

$P_{X}^{y}(A)=E(1_{A}(X)|Y=y)$ Existence and uniqueness of the needed conditional expectation is a consequence of the Radon–Nikodym theorem. This was formulated by Kolmogorov in his famous book from 1933. Kolmogorov underlines the importance of conditional probability by writing "I wish to call attention to ... and especially the theory of conditional probabilities and conditional expectations ..." in the Preface.^[8] The Bayes theorem determines the posterior distribution from the prior distribution. Uniqueness requires continuity assumptions.^[9] Bayes' theorem can be generalized to include improper prior distributions such as the uniform distribution on the real line.^[10] Modern Markov chain Monte Carlo methods have boosted the importance of Bayes' theorem including cases with improper priors.^[11]

Bayesian prediction

The posterior predictive distribution is the distribution of a new data point, marginalized over the posterior: $p({\tilde {x}}\mid \mathbf {X} ,\alpha )=\int p({\tilde {x}}\mid \theta )p(\theta \mid \mathbf {X} ,\alpha )d\theta$
The
prior predictive distribution
is the distribution of a new data point, marginalized over the prior: $p({\tilde {x}}\mid \alpha )=\int p({\tilde {x}}\mid \theta )p(\theta \mid \alpha )d\theta$

Bayesian theory calls for the use of the posterior predictive distribution to do

maximum likelihood or maximum a posteriori estimation (MAP)—and then plugging this estimate into the formula for the distribution of a data point. This has the disadvantage that it does not account for any uncertainty in the value of the parameter, and hence will underestimate the variance

of the predictive distribution.

In some instances, frequentist statistics can work around this problem. For example, confidence intervals and prediction intervals in frequentist statistics when constructed from a normal distribution with unknown mean and variance are constructed using a Student's t-distribution. This correctly estimates the variance, due to the facts that (1) the average of normally distributed random variables is also normally distributed, and (2) the predictive distribution of a normally distributed data point with unknown mean and variance, using conjugate or uninformative priors, has a Student's t-distribution. In Bayesian statistics, however, the posterior predictive distribution can always be determined exactly—or at least to an arbitrary level of precision when numerical methods are used.

Both types of predictive distributions have the form of a compound probability distribution (as does the marginal likelihood). In fact, if the prior distribution is a conjugate prior, such that the prior and posterior distributions come from the same family, it can be seen that both prior and posterior predictive distributions also come from the same family of compound distributions. The only difference is that the posterior predictive distribution uses the updated values of the hyperparameters (applying the Bayesian update rules given in the conjugate prior article), while the prior predictive distribution uses the values of the hyperparameters that appear in the prior distribution.

Mathematical properties

Interpretation of factor

${\textstyle {\frac {P(E\mid M)}{P(E)}}>1\Rightarrow P(E\mid M)>P(E)}$ . That is, if the model were true, the evidence would be more likely than is predicted by the current state of belief. The reverse applies for a decrease in belief. If the belief does not change, ${\textstyle {\frac {P(E\mid M)}{P(E)}}=1\Rightarrow P(E\mid M)=P(E)}$ . That is, the evidence is independent of the model. If the model were true, the evidence would be exactly as likely as predicted by the current state of belief.

Cromwell's rule

If $P(M)=0$ then $P(M\mid E)=0$ . If $P(M)=1$ and $P(E)>0$ , then $P(M|E)=1$ . This can be interpreted to mean that hard convictions are insensitive to counter-evidence.

The former follows directly from Bayes' theorem. The latter can be derived by applying the first rule to the event "not $M$ " in place of " $M$ ", yielding "if $1-P(M)=0$ , then $1-P(M\mid E)=0$ ", from which the result immediately follows.

Asymptotic behaviour of posterior

Consider the behaviour of a belief distribution as it is updated a large number of times with

Freedman and Persi Diaconis continued to work on the case of infinite countable probability spaces.^[14]

To summarise, there may be insufficient trials to suppress the effects of the initial choice, and especially for large (but finite) systems the convergence might be very slow.

Conjugate priors

In parameterized form, the prior distribution is often assumed to come from a family of distributions called conjugate priors. The usefulness of a conjugate prior is that the corresponding posterior distribution will be in the same family, and the calculation may be expressed in closed form.

Estimates of parameters and predictions

It is often desired to use a posterior distribution to estimate a parameter or variable. Several methods of Bayesian estimation select measurements of central tendency from the posterior distribution.

For one-dimensional problems, a unique median exists for practical continuous problems. The posterior median is attractive as a robust estimator.^[15]

If there exists a finite mean for the posterior distribution, then the posterior mean is a method of estimation.^[16] ${\tilde {\theta }}=\operatorname {E} [\theta ]=\int \theta \,p(\theta \mid \mathbf {X} ,\alpha )\,d\theta$

Taking a value with the greatest probability defines maximum a posteriori (MAP) estimates:^[17] $\{\theta _{\text{MAP}}\}\subset \arg \max _{\theta }p(\theta \mid \mathbf {X} ,\alpha ).$

There are examples where no maximum is attained, in which case the set of MAP estimates is empty.

There are other methods of estimation that minimize the posterior

statistical decision theory using the sampling distribution ("frequentist statistics").^[18]

The posterior predictive distribution of a new observation ${\tilde {x}}$ (that is independent of previous observations) is determined by^[19] $p({\tilde {x}}|\mathbf {X} ,\alpha )=\int p({\tilde {x}},\theta \mid \mathbf {X} ,\alpha )\,d\theta =\int p({\tilde {x}}\mid \theta )p(\theta \mid \mathbf {X} ,\alpha )\,d\theta .$

Examples

Probability of a hypothesis

Contingency table

Bowl

Cookie
#1
H₁ #2
H₂
Total

Plain, E 30 20 50

Choc, ¬E 10 20 30

Total 40 40 80

P(H₁|E) = 30 / 50 = 0.6

Suppose there are two full bowls of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than a half, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let $H_{1}$ correspond to bowl #1, and $H_{2}$ to bowl #2. It is given that the bowls are identical from Fred's point of view, thus $P(H_{1})=P(H_{2})$ , and the two must add up to 1, so both are equal to 0.5. The event $E$ is the observation of a plain cookie. From the contents of the bowls, we know that $P(E\mid H_{1})=30/40=0.75$ and $P(E\mid H_{2})=20/40=0.5.$ Bayes' formula then yields ${\begin{aligned}P(H_{1}\mid E)&={\frac {P(E\mid H_{1})\,P(H_{1})}{P(E\mid H_{1})\,P(H_{1})\;+\;P(E\mid H_{2})\,P(H_{2})}}\\\\\ &={\frac {0.75\times 0.5}{0.75\times 0.5+0.5\times 0.5}}\\\\\ &=0.6\end{aligned}}$
Before we observed the cookie, the probability we assigned for Fred having chosen bowl #1 was the prior probability, $P(H_{1})$ , which was 0.5. After observing the cookie, we must revise the probability to $P(H_{1}\mid E)$ , which is 0.6.

Making a prediction

Example results for archaeology example. This simulation was generated using c=15.2.

An archaeologist is working at a site thought to be from the medieval period, between the 11th century to the 16th century. However, it is uncertain exactly when in this period the site was inhabited. Fragments of pottery are found, some of which are glazed and some of which are decorated. It is expected that if the site were inhabited during the early medieval period, then 1% of the pottery would be glazed and 50% of its area decorated, whereas if it had been inhabited in the late medieval period then 81% would be glazed and 5% of its area decorated. How confident can the archaeologist be in the date of inhabitation as fragments are unearthed?
The degree of belief in the continuous variable $C$ (century) is to be calculated, with the discrete set of events $\{GD,G{\bar {D}},{\bar {G}}D,{\bar {G}}{\bar {D}}\}$ as evidence. Assuming linear variation of glaze and decoration with time, and that these variables are independent,
$P(E=GD\mid C=c)=(0.01+{\frac {0.81-0.01}{16-11}}(c-11))(0.5-{\frac {0.5-0.05}{16-11}}(c-11))$ $P(E=G{\bar {D}}\mid C=c)=(0.01+{\frac {0.81-0.01}{16-11}}(c-11))(0.5+{\frac {0.5-0.05}{16-11}}(c-11))$ $P(E={\bar {G}}D\mid C=c)=((1-0.01)-{\frac {0.81-0.01}{16-11}}(c-11))(0.5-{\frac {0.5-0.05}{16-11}}(c-11))$ $P(E={\bar {G}}{\bar {D}}\mid C=c)=((1-0.01)-{\frac {0.81-0.01}{16-11}}(c-11))(0.5+{\frac {0.5-0.05}{16-11}}(c-11))$
Assume a uniform prior of ${\textstyle f_{C}(c)=0.2}$ , and that trials are
independent and identically distributed
. When a new fragment of type $e$ is discovered, Bayes' theorem is applied to update the degree of belief for each $c$ : $f_{C}(c\mid E=e)={\frac {P(E=e\mid C=c)}{P(E=e)}}f_{C}(c)={\frac {P(E=e\mid C=c)}{\int _{11}^{16}{P(E=e\mid C=c)f_{C}(c)dc}}}f_{C}(c)$
A computer simulation of the changing belief as 50 fragments are unearthed is shown on the graph. In the simulation, the site was inhabited around 1420, or $c=15.2$ . By calculating the area under the relevant portion of the graph for 50 trials, the archaeologist can say that there is practically no chance the site was inhabited in the 11th and 12th centuries, about 1% chance that it was inhabited during the 13th century, 63% chance during the 14th century and 36% during the 15th century. The Bernstein-von Mises theorem asserts here the asymptotic convergence to the "true" distribution because the probability space corresponding to the discrete set of events $\{GD,G{\bar {D}},{\bar {G}}D,{\bar {G}}{\bar {D}}\}$ is finite (see above section on asymptotic behaviour of the posterior).

In frequentist statistics and decision theory

A
decision-theoretic justification of the use of Bayesian inference was given by Abraham Wald, who proved that every unique Bayesian procedure is admissible. Conversely, every admissible statistical procedure is either a Bayesian procedure or a limit of Bayesian procedures.^[20]

Wald characterized admissible procedures as Bayesian procedures (and limits of Bayesian procedures), making the Bayesian formalism a central technique in such areas of
confidence intervals.^[21]^[22]^[23]
For example:

"Under some conditions, all admissible procedures are either Bayes procedures or limits of Bayes procedures (in various senses). These remarkable results, at least in their original form, are due essentially to Wald. They are useful because the property of being Bayes is easier to analyze than admissibility."[20]

"In decision theory, a quite general method for proving admissibility consists in exhibiting a procedure as a unique Bayes solution."^[24]

"In the first chapters of this work, prior distributions with finite support and the corresponding Bayes procedures were used to establish some of the main theorems relating to the comparison of experiments. Bayes procedures with respect to more general prior distributions have played a very important role in the development of statistics, including its asymptotic theory." "There are many problems where a glance at posterior distributions, for suitable priors, yields immediately interesting information. Also, this technique can hardly be avoided in sequential analysis."^[25]

"A useful fact is that any Bayes decision rule obtained by taking a proper prior over the whole parameter space must be admissible"^[26]

"An important area of investigation in the development of admissibility ideas has been that of conventional sampling-theory procedures, and many interesting results have been obtained."^[27]

Model selection

Main article:
Bayesian model selection

See also: Bayesian information criterion

Bayesian methodology also plays a role in model selection where the aim is to select one model from a set of competing models that represents most closely the underlying process that generated the observed data. In Bayesian model comparison, the model with the highest posterior probability given the data is selected. The posterior probability of a model depends on the evidence, or marginal likelihood, which reflects the probability that the data is generated by the model, and on the prior belief of the model. When two competing models are a priori considered to be equiprobable, the ratio of their posterior probabilities corresponds to the Bayes factor. Since Bayesian model comparison is aimed on selecting the model with the highest posterior probability, this methodology is also referred to as the maximum a posteriori (MAP) selection rule ^[28] or the MAP probability rule.^[29]

Probabilistic programming

Main article: Probabilistic programming

While conceptually simple, Bayesian methods can be mathematically and numerically challenging. Probabilistic programming languages (PPLs) implement functions to easily build Bayesian models together with efficient automatic inference methods. This helps separate the model building from the inference, allowing practitioners to focus on their specific problems and leaving PPLs to handle the computational details for them.^[30]^[31]^[32]

Applications

Statistical data analysis

See the separate Wikipedia entry on Bayesian statistics, specifically the statistical modeling section in that page.

Computer applications

Bayesian inference has applications in artificial intelligence and expert systems. Bayesian inference techniques have been a fundamental part of computerized pattern recognition techniques since the late 1950s.^[33] There is also an ever-growing connection between Bayesian methods and simulation-based Monte Carlo techniques since complex models cannot be processed in closed form by a Bayesian analysis, while a graphical model structure may allow for efficient simulation algorithms like the Gibbs sampling and other Metropolis–Hastings algorithm schemes.^[34] Recently^[when?] Bayesian inference has gained popularity among the phylogenetics community for these reasons; a number of applications allow many demographic and evolutionary parameters to be estimated simultaneously.
As applied to
naïve Bayes classifier
.
Occam's Razor.^[35]^{[unreliable source?]} Solomonoff's universal prior probability of any prefix p of a computable sequence x is the sum of the probabilities of all programs (for a universal computer) that compute something starting with p. Given some p and any computable but unknown probability distribution from which x is sampled, the universal prior and Bayes' theorem can be used to predict the yet unseen parts of x in optimal fashion.^[36]^[37]

Bioinformatics and healthcare applications

Bayesian inference has been applied in different Bioinformatics applications, including differential gene expression analysis.^[38] Bayesian inference is also used in a general cancer risk model, called CIRI (Continuous Individualized Risk Index), where serial measurements are incorporated to update a Bayesian model which is primarily built from prior knowledge.^[39]^[40]

In the courtroom

Main article: Jurimetrics § Bayesian analysis of evidence

Bayesian inference can be used by jurors to coherently accumulate the evidence for and against a defendant, and to see whether, in totality, it meets their personal threshold for "
betting odds are more widely understood than probabilities. Alternatively, a logarithmic approach
, replacing multiplication with addition, might be easier for a jury to handle.

Adding up evidence

If the existence of the crime is not in doubt, only the identity of the culprit, it has been suggested that the prior should be uniform over the qualifying population.^[44] For example, if 1,000 people could have committed the crime, the prior probability of guilt would be 1/1000.
The use of Bayes' theorem by jurors is controversial. In the United Kingdom, a defence
R v Adams
. The jury convicted, but the case went to appeal on the basis that no means of accumulating evidence had been provided for jurors who did not wish to use Bayes' theorem. The Court of Appeal upheld the conviction, but it also gave the opinion that "To introduce Bayes' Theorem, or any similar method, into a criminal trial plunges the jury into inappropriate and unnecessary realms of theory and complexity, deflecting them from their proper task."

Gardner-Medwin

frequentist p-value

). He argues that if the posterior probability of guilt is to be computed by Bayes' theorem, the prior probability of guilt must be known. This will depend on the incidence of the crime, which is an unusual piece of evidence to consider in a criminal trial. Consider the following three propositions:

A – the known facts and testimony could have arisen if the defendant is guilty.

B – the known facts and testimony could have arisen if the defendant is innocent.

C – the defendant is guilty.

Gardner-Medwin argues that the jury should believe both A and not-B in order to convict. A and not-B implies the truth of C, but the reverse is not true. It is possible that B and C are both true, but in this case he argues that a jury should acquit, even though they know that they will be letting some guilty people go free. See also Lindley's paradox.

Bayesian epistemology

Bayesian epistemology is a movement that advocates for Bayesian inference as a means of justifying the rules of inductive logic.

justificationist epistemology, because it presupposes what it attempts to justify. According to this view, a rational interpretation of Bayesian inference would see it merely as a probabilistic version of falsification

, rejecting the belief, commonly held by Bayesians, that high likelihood achieved by a series of Bayesian updates would prove the hypothesis beyond any reasonable doubt, or even with likelihood greater than 0.

Other

The scientific method is sometimes interpreted as an application of Bayesian inference. In this view, Bayes' rule guides (or should guide) the updating of probabilities about hypotheses conditional on new observations or experiments.^[47] The Bayesian inference has also been applied to treat stochastic scheduling problems with incomplete information by Cai et al. (2009).^[48]
Bayesian search theory is used to search for lost objects.
Bayesian inference in phylogeny
Bayesian tool for methylation analysis
Bayesian approaches to brain function investigate the brain as a Bayesian mechanism.
Bayesian inference in ecological studies^[49]^[50]
Bayesian inference is used to estimate parameters in stochastic chemical kinetic models^[51]
Bayesian inference in econophysics for currency or prediction of trend changes in financial quotations^[52]^[53]
Bayesian inference in marketing
Bayesian inference in motor learning
Bayesian inference is used in probabilistic numerics to solve numerical problems

Bayes and Bayesian inference

The problem considered by Bayes in Proposition 9 of his essay, "An Essay Towards Solving a Problem in the Doctrine of Chances", is the posterior distribution for the parameter a (the success rate) of the binomial distribution.^{[citation needed]}

History

The term Bayesian refers to

frequentist statistics.^[56]

In the 20th century, the ideas of Laplace were further developed in two different directions, giving rise to objective and subjective currents in Bayesian practice. In the objective or "non-informative" current, the statistical analysis depends on only the model assumed, the data analyzed,[57] and the method assigning the prior, which differs from one objective Bayesian practitioner to another. In the subjective or "informative" current, the specification of the prior depends on the belief (that is, propositions on which the analysis is prepared to act), which can summarize information from experts, previous studies, etc.

In the 1980s, there was a dramatic growth in research and applications of Bayesian methods, mostly attributed to the discovery of Markov chain Monte Carlo methods, which removed many of the computational problems, and an increasing interest in nonstandard, complex applications.^[58] Despite growth of Bayesian research, most undergraduate teaching is still based on frequentist statistics.^[59] Nonetheless, Bayesian methods are widely accepted and used, such as for example in the field of machine learning.^[60]

References

Citations

^ "Bayesian". Merriam-Webster.com Dictionary. Merriam-Webster.
S2CID 14344339
.

^ "Bayes' Theorem (Stanford Encyclopedia of Philosophy)". Plato.stanford.edu. Retrieved 2014-01-05.

ISBN 0-19-824860-1
.

ISBN 978-1-4398-4095-5
.

S2CID 88521802
.

S2CID 220935477
.

^ Kolmogorov, A.N. (1933) [1956]. Foundations of the Theory of Probability. Chelsea Publishing Company.

ISBN 978-0-471-27824-5
.

S2CID 237736986
.

OCLC 1159112760
.

JSTOR 2238346
.

JSTOR 2238150
.

S2CID 120767108
.

^ Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM.

ISBN 9780444515391
.

^ "Maximum A Posteriori (MAP) Estimation". www.probabilitycourse.com. Retrieved 2017-06-02.

^ Yu, Angela. "Introduction to Bayesian Decision Theory" (PDF). cogsci.ucsd.edu/. Archived from the original (PDF) on 2013-02-28.

^ Hitchcock, David. "Posterior Predictive Distribution Stat Slide" (PDF). stat.sc.edu.

^ ^a ^b Bickel & Doksum (2001, p. 32)

doi:10.1214/aoms/1177700051
.

doi:10.1214/aoms/1177697822
.

doi:10.1214/aos/1176345877
.

^ Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.). (see p. 309 of Chapter 6.7 "Admissibility", and pp. 17–18 of Chapter 1.8 "Complete Classes"

ISBN 978-0-387-96307-5
. (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)

ISBN 978-0-04-121537-3
.

ISBN 978-0-04-121537-3
.)

S2CID 17338979
.

S2CID 104419861
.

^ Bessiere, P., Mazer, E., Ahuactzin, J. M., & Mekhnacha, K. (2013). Bayesian Programming (1 edition) Chapman and Hall/CRC.

^ Daniel Roy (2015). "Probabilistic Programming". probabilistic-programming.org. Archived from the original on 2016-01-10. Retrieved 2020-01-02.

S2CID 216356
.

doi:10.1214/06-BA101
.

ISBN 978-0-387-92297-3
.

S2CID 2499910
.

S2CID 1500830
.

CiteSeerX 10.1.1.186.8268
.

^ Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.

^ "CIRI". ciri.stanford.edu. Retrieved 2019-08-11.

PMID 31280963
.

^ Dawid, A. P. and Mortera, J. (1996) "Coherent Analysis of Forensic Identification Evidence". Journal of the Royal Statistical Society, Series B, 58, 425–443.

^ Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". Journal of the Royal Statistical Society, Series A, 160, 429–469.

ISBN 978-0-471-96026-3
.

^ Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by Juries. Archived 2015-07-01 at the Wayback Machine

Significance
, 2 (1), March 2005.

ISBN 978-0-8126-9197-9
.

^ Howson & Urbach (2005), Jaynes (2003)

doi:10.1287/opre.1080.0660
.

PMID 24640543
.

S2CID 131588159
.

PMID 27429455
.

^ Fornalski, K.W. (2016). "The Tadpole Bayesian Model for Detecting Trend Changes in Financial Quotations" (PDF). R&R Journal of Statistics and Mathematical Sciences. 2 (1): 117–122.

S2CID 11460968
.

^ Stigler, Stephen (1982). "Thomas Bayes's Bayesian Inference". Journal of the Royal Statistical Society. 145 (2): 250–58.

ISBN 9780674403406
.

^
doi:10.1214/06-ba101
.

^ Bernardo, José-Miguel (2005). "Reference analysis". Handbook of statistics. Vol. 25. pp. 17–90.

S2CID 120094454
.

^ Bernardo, José M. (2006). "A Bayesian mathematical statistics primer" (PDF). Icots-7.

ISBN 978-0387310732
.

Sources

Aster, Richard; Borchers, Brian, and Thurber, Clifford (2012). Parameter Estimation and Inverse Problems, Second Edition, Elsevier.
ISBN 978-0123850485

Bickel, Peter J. & Doksum, Kjell A. (2001). Mathematical Statistics, Volume 1: Basic and Selected Topics (Second (updated printing 2007) ed.). Pearson Prentice–Hall.
ISBN 978-0-13-850363-5
.

ISBN 0-471-57428-7

Edwards, Ward (1968). "Conservatism in Human Information Processing". In Kleinmuntz, B. (ed.). Formal Representation of Human Judgment. Wiley.

Edwards, Ward (1982).
S2CID 143452957
. Chapter: Conservatism in Human Information Processing (excerpted)

ISBN 978-0-521-59271-0 (Link to Fragmentary Edition of March 1996
).

ISBN 978-0-8126-9578-6
.

Phillips, L. D.; Edwards, Ward (October 2008). "Chapter 6: Conservatism in a Simple Probability Inference Task (Journal of Experimental Psychology (1966) 72: 346-354)". In Jie W. Weiss; David J. Weiss (eds.). A Science of Decision Making:The Legacy of Ward Edwards. Oxford University Press. p. 536.
ISBN 978-0-19-532298-9
.

Further reading

For a full report on the history of Bayesian statistics and the debates with frequentists approaches, read Vallverdu, Jordi (2016). Bayesians Versus Frequentists A Philosophical Debate on Statistical Reasoning. New York: Springer.
ISBN 978-3-662-48638-2
.

ISBN 978-0-231-55335-3
.

Elementary

The following books are listed in ascending order of probabilistic sophistication:

Stone, JV (2013), "Bayes' Rule: A Tutorial Introduction to Bayesian Analysis", Download first chapter here, Sebtel Press, England.

ISBN 978-1-118-65012-7
.

ISBN 978-0-8126-9578-6
.

Berry, Donald A. (1996). Statistics: A Bayesian Perspective. Duxbury.
ISBN 978-0-534-23476-8
.

ISBN 978-0-201-52488-8
.

Bolstad, William M. (2007) Introduction to Bayesian Statistics: Second Edition, John Wiley
ISBN 0-471-27020-2

Winkler, Robert L (2003). Introduction to Bayesian Inference and Decision (2nd ed.). Probabilistic.
ISBN 978-0-9647938-4-2
. Updated classic textbook. Bayesian theory clearly presented.

Lee, Peter M. Bayesian Statistics: An Introduction. Fourth Edition (2012), John Wiley
ISBN 978-1-1183-3257-3

Carlin, Bradley P. & Louis, Thomas A. (2008). Bayesian Methods for Data Analysis, Third Edition. Boca Raton, FL: Chapman and Hall/CRC.
ISBN 978-1-58488-697-6
.

ISBN 978-1-4398-4095-5
.

Intermediate or advanced

ISBN 978-0-387-96098-2
.

Bernardo, José M.; Smith, Adrian F. M. (1994). Bayesian Theory. Wiley.

ISBN 0-471-68029-X
.

Schervish, Mark J. (1995). Theory of statistics. Springer-Verlag.
ISBN 978-0-387-94546-0
.

Jaynes, E. T. (1998). Probability Theory: The Logic of Science.

O'Hagan, A. and Forster, J. (2003). Kendall's Advanced Theory of Statistics, Volume 2B: Bayesian Inference. Arnold, New York.
ISBN 0-340-52922-9
.

Robert, Christian P (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation (paperback ed.). Springer.
ISBN 978-0-387-71598-8
.

Pearl, Judea. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, San Mateo, CA: Morgan Kaufmann.

Pierre Bessière et al. (2013). "Bayesian Programming". CRC Press.
ISBN 9781439880326

Francisco J. Samaniego (2010). "A Comparison of the Bayesian and Frequentist Approaches to Estimation". Springer. New York,
ISBN 978-1-4419-5940-9

External links

"Bayesian approach to statistical problems", Encyclopedia of Mathematics, EMS Press, 2001 [1994]

Bayesian Statistics from Scholarpedia.

Introduction to Bayesian probability from Queen Mary University of London

Mathematical Notes on Bayesian Statistics and Markov Chain Monte Carlo

Bayesian reading list Archived 2011-06-25 at the Wayback Machine, categorized and annotated by Tom Griffiths

A. Hajek and S. Hartmann: Bayesian Epistemology, in: J. Dancy et al. (eds.), A Companion to Epistemology. Oxford: Blackwell 2010, 93–106.

S. Hartmann and J. Sprenger: Bayesian Epistemology, in: S. Bernecker and D. Pritchard (eds.), Routledge Companion to Epistemology. London: Routledge 2010, 609–620.

Stanford Encyclopedia of Philosophy: "Inductive Logic"

Bayesian Confirmation Theory (PDF)

What is Bayesian Learning?

Data, Uncertainty and Inference — Informal introduction with many examples, ebook (PDF) freely available at causaScientia

v
t
e
Statistics

Outline

Index

Continuous data
Center

Mean
Arithmetic

Arithmetic-Geometric

Contraharmonic

Cubic

Generalized/power

Geometric

Harmonic

Heronian

Heinz

Lehmer

Median

Mode

Dispersion

Average absolute deviation

Coefficient of variation

Interquartile range

Percentile

Range

Standard deviation

Variance

Shape

Central limit theorem

Moments
Kurtosis

L-moments

Skewness

Count data

Index of dispersion

Summary tables

Contingency table

Frequency distribution

Grouped data

Dependence

Partial correlation

Pearson product-moment correlation

Rank correlation
Kendall's τ

Spearman's ρ

Scatter plot

Graphics

Bar chart

Biplot

Box plot

Control chart

Correlogram

Fan chart

Forest plot

Histogram

Pie chart

Q–Q plot

Radar chart

Run chart

Scatter plot

Stem-and-leaf display

Violin plot

Data collection
Study design

Effect size

Missing data

Optimal design

Population

Replication

Sample size determination

Statistic

Statistical power

Survey methodology

Sampling
Cluster

Stratified

Opinion poll

Questionnaire

Standard error

Controlled experiments

Blocking

Factorial experiment

Interaction

Random assignment

Randomized controlled trial

Randomized experiment

Scientific control

Adaptive designs

Adaptive clinical trial

Stochastic approximation

Up-and-down designs

Observational studies

Cohort study

Cross-sectional study

Natural experiment

Quasi-experiment

Statistical inference
Statistical theory

Population

Statistic

Probability distribution

Sampling distribution
Order statistic

Empirical distribution
Density estimation

Statistical model
Model specification

L^p space

Parameter
location

scale

shape

Parametric family
Likelihood (monotone)

Location–scale family

Exponential family

Completeness

Sufficiency

Statistical functional

Bootstrap

U

V

Optimal decision
loss function

Efficiency

Statistical distance
divergence

Asymptotics

Robustness

Frequentist inference
Point estimation

Estimating equations
Maximum likelihood

Method of moments

M-estimator

Minimum distance

Unbiased estimators
Mean-unbiased minimum-variance
Rao–Blackwellization

Lehmann–Scheffé theorem

Median unbiased

Plug-in

Interval estimation

Confidence interval

Pivot

Likelihood interval

Prediction interval

Tolerance interval

Resampling
Bootstrap

Jackknife

Testing hypotheses

1- & 2-tails

Power
Uniformly most powerful test

Permutation test
Randomization test

Multiple comparisons

Parametric tests

Likelihood-ratio

Score/Lagrange multiplier

Wald

Specific tests

Z-test (normal)

Student's t-test

F-test

Goodness of fit

Chi-squared

G-test

Kolmogorov–Smirnov

Anderson–Darling

Lilliefors

Jarque–Bera

Normality (Shapiro–Wilk)

Likelihood-ratio test

Model selection
Cross validation

AIC

BIC

Rank statistics

Sign
Sample median

Signed rank (Wilcoxon)
Hodges–Lehmann estimator

Rank sum (Mann–Whitney)

Nonparametric anova
1-way (Kruskal–Wallis)

2-way (Friedman)

Ordered alternative (Jonckheere–Terpstra)

Van der Waerden test

Bayesian inference

Bayesian probability
prior

posterior

Credible interval

Bayes factor

Bayesian estimator
Maximum posterior estimator

Correlation

Pearson product-moment

Partial correlation

Confounding variable

Coefficient of determination

Regression analysis

Errors and residuals

Regression validation

Mixed effects models

Simultaneous equations models

Multivariate adaptive regression splines (MARS)

Linear regression

Simple linear regression

Ordinary least squares

General linear model

Bayesian regression

Non-standard predictors

Nonlinear regression

Nonparametric

Semiparametric

Isotonic

Robust

Homoscedasticity and Heteroscedasticity

Generalized linear model

Exponential families

Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance

Analysis of variance (ANOVA, anova)

Analysis of covariance

Multivariate ANOVA

Degrees of freedom

Categorical / multivariate / time-series / survival analysis
Categorical

Cohen's kappa

Contingency table

Graphical model

Log-linear model

McNemar's test

Cochran–Mantel–Haenszel statistics

Multivariate

Regression

Manova

Principal components

Canonical correlation

Discriminant analysis

Cluster analysis

Classification

Structural equation model
Factor analysis

Multivariate distributions

Elliptical distributions
Normal

Time-series
General

Decomposition

Trend

Stationarity

Seasonal adjustment

Exponential smoothing

Cointegration

Structural break

Granger causality

Specific tests

Dickey–Fuller

Johansen

Q-statistic (Ljung–Box)

Durbin–Watson

Breusch–Godfrey

Time domain

Autocorrelation (ACF)
partial (PACF)

Cross-correlation (XCF)

ARMA model

ARIMA model (Box–Jenkins)

Autoregressive conditional heteroskedasticity (ARCH)

Vector autoregression (VAR)

Frequency domain

Spectral density estimation

Fourier analysis

Least-squares spectral analysis

Wavelet

Whittle likelihood

Survival
Survival function

Kaplan–Meier estimator (product limit)

Proportional hazards models

Accelerated failure time (AFT) model

First hitting time

Hazard function

Nelson–Aalen estimator

Test

Log-rank test

Applications
Biostatistics

Bioinformatics

Clinical trials / studies

Epidemiology

Medical statistics

Engineering statistics

Chemometrics

Methods engineering

Probabilistic design

Process / quality control

Reliability

System identification

Social statistics

Actuarial science

Census

Crime statistics

Demography

Econometrics

Jurimetrics

National accounts

Official statistics

Population statistics

Psychometrics

Spatial statistics

Cartography

Environmental statistics

Geographic information system

Geostatistics

Kriging

Category

Mathematics portal

Commons

WikiProject

Authority control databases: National
Germany
United States
Czech Republic
Israel

Retrieved from "https://en.wikipedia.org/w/index.php?title=Bayesian_inference&oldid=1293484403"

[1] "Bayesian". Merriam-Webster.com Dictionary. Merriam-Webster.

[2] S2CID 14344339
.

[3] "Bayes' Theorem (Stanford Encyclopedia of Philosophy)". Plato.stanford.edu. Retrieved 2014-01-05.

[4] ISBN 0-19-824860-1
.

[5] ISBN 978-1-4398-4095-5
.

[deCarvalho-Geometry-6] S2CID 88521802
.

[Lee-GibbsSampler-7] S2CID 220935477
.

[8] Kolmogorov, A.N. (1933) [1956]. Foundations of the Theory of Probability. Chelsea Publishing Company.

[9] ISBN 978-0-471-27824-5
.

[10] S2CID 237736986
.

[11] OCLC 1159112760
.

[12] JSTOR 2238346
.

[13] JSTOR 2238150
.

[14] S2CID 120767108
.

[15] Sen, Pranab K.; Keating, J. P.; Mason, R. L. (1993). Pitman's measure of closeness: A comparison of statistical estimators. Philadelphia: SIAM.

[16] ISBN 9780444515391
.

[17] "Maximum A Posteriori (MAP) Estimation". www.probabilitycourse.com. Retrieved 2017-06-02.

[18] Yu, Angela. "Introduction to Bayesian Decision Theory" (PDF). cogsci.ucsd.edu/. Archived from the original (PDF) on 2013-02-28.

[19] Hitchcock, David. "Posterior Predictive Distribution Stat Slide" (PDF). stat.sc.edu.

[Bickel_&_Doksum_2001,_page_32-20] Bickel & Doksum (2001, p. 32)

[21] doi:10.1214/aoms/1177700051
.

[22] :10.1214/aoms/1177697822
.

[23] :10.1214/aos/1176345877
.

[24] Lehmann, Erich (1986). Testing Statistical Hypotheses (Second ed.). (see p. 309 of Chapter 6.7 "Admissibility", and pp. 17–18 of Chapter 1.8 "Complete Classes"

[25] ISBN 978-0-387-96307-5
. (From "Chapter 12 Posterior Distributions and Bayes Solutions", p. 324)

[26] ISBN 978-0-04-121537-3
.

[27] ISBN 978-0-04-121537-3
.)

[28] S2CID 17338979
.

[29] S2CID 104419861
.

[30] Bessiere, P., Mazer, E., Ahuactzin, J. M., & Mekhnacha, K. (2013). Bayesian Programming (1 edition) Chapman and Hall/CRC.

[31] Daniel Roy (2015). "Probabilistic Programming". probabilistic-programming.org. Archived from the original on 2016-01-10. Retrieved 2020-01-02.

[32] S2CID 216356
.

[33] :10.1214/06-BA101
.

[34] ISBN 978-0-387-92297-3
.

[35] S2CID 2499910
.

[36] S2CID 1500830
.

[37] CiteSeerX 10.1.1.186.8268
.

[:edgr-38] Robinson, Mark D & McCarthy, Davis J & Smyth, Gordon K edgeR: a Bioconductor package for differential expression analysis of digital gene expression data, Bioinformatics.

[39] "CIRI". ciri.stanford.edu. Retrieved 2019-08-11.

[40] PMID 31280963
.

[41] Dawid, A. P. and Mortera, J. (1996) "Coherent Analysis of Forensic Identification Evidence". Journal of the Royal Statistical Society, Series B, 58, 425–443.

[42] Foreman, L. A.; Smith, A. F. M., and Evett, I. W. (1997). "Bayesian analysis of deoxyribonucleic acid profiling data in forensic identification applications (with discussion)". Journal of the Royal Statistical Society, Series A, 160, 429–469.

[43] ISBN 978-0-471-96026-3
.

[44] Dawid, A. P. (2001) Bayes' Theorem and Weighing Evidence by Juries. Archived 2015-07-01 at the Wayback Machine

[45] Significance
, 2 (1), March 2005.

[46] ISBN 978-0-8126-9197-9
.

[47] Howson & Urbach (2005), Jaynes (2003)

[Cai_et_al._2009-48] :10.1287/opre.1080.0660
.

[49] PMID 24640543
.

[50] S2CID 131588159
.

[51] PMID 27429455
.

[52] Fornalski, K.W. (2016). "The Tadpole Bayesian Model for Detecting Trend Changes in Financial Quotations" (PDF). R&R Journal of Statistics and Mathematical Sciences. 2 (1): 117–122.

[53] S2CID 11460968
.

[54] Stigler, Stephen (1982). "Thomas Bayes's Bayesian Inference". Journal of the Royal Statistical Society. 145 (2): 250–58.

[Stigler1986-55] ISBN 9780674403406
.

[Fienberg2006-56] 
doi:10.1214/06-ba101
.

[Bernardo2005-57] Bernardo, José-Miguel (2005). "Reference analysis". Handbook of statistics. Vol. 25. pp. 17–90.

[Wolpert2004-58] S2CID 120094454
.

[Bernardo2006-59] Bernardo, José M. (2006). "A Bayesian mathematical statistics primer" (PDF). Icots-7.

[Bishop2007-60] ISBN 978-0387310732
.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[44]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[56]

[58]

[59]

[60]