Ancillary statistic

An ancillary statistic is a

sample whose distribution (or whose pmf or pdf) does not depend on the parameters of the model.^[1]^[2]^[3] An ancillary statistic is a pivotal quantity that is also a statistic. Ancillary statistics can be used to construct prediction intervals. They are also used in connection with Basu's theorem to prove independence between statistics.^[4]

This concept was first introduced by Ronald Fisher in the 1920s,^[5] but its formal definition was only provided in 1964 by Debabrata Basu.^[6]^[7]

Examples

Suppose X₁, ..., X_n are

independent and identically distributed, and are normally distributed with unknown expected value μ and known variance

1. Let

{\overline {X}}_{n}={\frac {X_{1}+\,\cdots \,+X_{n}}{n}}

be the sample mean.

The following statistical measures of dispersion of the sample

Range: max(X₁, ..., X_n) − min(X₁, ..., X_n)
Interquartile range: Q₃ − Q₁
Sample variance
:

{\hat {\sigma }}^{2}:=\,{\frac {\sum \left(X_{i}-{\overline {X}}\right)^{2}}{n}}

are all ancillary statistics, because their sampling distributions do not change as μ changes. Computationally, this is because in the formulas, the μ terms cancel – adding a constant number to a distribution (and all samples) changes its sample maximum and minimum by the same amount, so it does not change their difference, and likewise for others: these measures of dispersion do not depend on location.

Conversely, given i.i.d. normal variables with known mean 1 and unknown variance σ², the sample mean ${\overline {X}}$ is not an ancillary statistic of the variance, as the sampling distribution of the sample mean is N(1, σ²/n), which does depend on σ ² – this measure of location (specifically, its standard error) depends on dispersion.^[8]

In location-scale families

In a

location family of distributions

,

(X_{1}-X_{n},X_{2}-X_{n},\dots ,X_{n-1}-X_{n})

is an ancillary statistic.

In a scale family of distributions, $({\frac {X_{1}}{X_{n}}},{\frac {X_{2}}{X_{n}}},\dots ,{\frac {X_{n-1}}{X_{n}}})$ is an ancillary statistic.

In a

location-scale family of distributions

,

({\frac {X_{1}-X_{n}}{S}},{\frac {X_{2}-X_{n}}{S}},\dots ,{\frac {X_{n-1}-X_{n}}{S}})

, where

S^{2}

is the sample variance, is an ancillary statistic.^[3]^[9]

In recovery of information

It turns out that, if $T_{1}$ is a non-sufficient statistic and $T_{2}$ is ancillary, one can sometimes recover all the information about the unknown parameter contained in the entire data by reporting $T_{1}$ while conditioning on the observed value of $T_{2}$ . This is known as conditional inference.^[3]

For example, suppose that $X_{1},X_{2}$ follow the $N(\theta ,1)$ distribution where $\theta$ is unknown. Note that, even though $X_{1}$ is not sufficient for $\theta$ (since its Fisher information is 1, whereas the Fisher information of the complete statistic ${\overline {X}}$ is 2), by additionally reporting the ancillary statistic $X_{1}-X_{2}$ , one obtains a joint distribution with Fisher information 2.^[3]

Ancillary complement

Given a statistic T that is not

sufficient, an ancillary complement is a statistic U that is ancillary and such that (T, U) is sufficient.^[2]

Intuitively, an ancillary complement "adds the missing information" (without duplicating any).

The statistic is particularly useful if one takes T to be a

maximum likelihood estimator, which in general will not be sufficient; then one can ask for an ancillary complement. In this case, Fisher argues that one must condition on an ancillary complement to determine information content: one should consider the Fisher information

content of T to not be the marginal of T, but the conditional distribution of T, given U: how much information does T add? This is not possible in general, as no ancillary complement need exist, and if one exists, it need not be unique, nor does a maximum ancillary complement exist.

Example

In

independent of the batter's ability – say a coin is tossed after each at-bat and the result determines whether the scout will stay to watch the batter's next at-bat. The eventual data are the number N of at-bats and the number X of hits: the data (X, N) are a sufficient statistic. The observed batting average X/N fails to convey all of the information available in the data because it fails to report the number N of at-bats (e.g., a batting average of 0.400, which is very high

, based on only five at-bats does not inspire anywhere near as much confidence in the player's ability than a 0.400 average based on 100 at-bats). The number N of at-bats is an ancillary statistic because

It is a part of the observable data (it is a statistic), and
Its probability distribution does not depend on the batter's ability, since it was chosen by a random process independent of the batter's ability.

This ancillary statistic is an ancillary complement to the observed batting average X/N, i.e., the batting average X/N is not a

sufficient statistic

, in that it conveys less than all of the relevant information in the data, but conjoined with N, it becomes sufficient.

Notes

JSTOR 4355624
.

^
JSTOR 24309506
.

^
ISBN 0-8247-0379-0
.

ISBN 978-1-4419-5825-9

ISSN 0305-0041
.

JSTOR 25049300
.

ISBN 978-0-940600-50-8
, retrieved 2023-04-24

ISSN 0162-1459
.

^ "Ancillary statistics" (PDF).

Retrieved from "https://en.wikipedia.org/w/index.php?title=Ancillary_statistic&oldid=1190780720"

[1] JSTOR 4355624
.

[fraser-2] 
JSTOR 24309506
.

[:0-3] 
ISBN 0-8247-0379-0
.

[4] ISBN 978-1-4419-5825-9

[5] ISSN 0305-0041
.

[6] JSTOR 25049300
.

[7] ISBN 978-0-940600-50-8
, retrieved 2023-04-24

[8] ISSN 0162-1459
.

[9] "Ancillary statistics" (PDF).

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]