Censoring (statistics)

In statistics, censoring is a condition in which the value of a measurement or observation is only partially known.

For example, suppose a study is conducted to measure the impact of a drug on mortality rate. In such a study, it may be known that an individual's age at death is at least 75 years (but may be more). Such a situation could occur if the individual withdrew from the study at age 75, or if the individual is currently alive at the age of 75.

Censoring also occurs when a value occurs outside the range of a

measuring instrument

. For example, a bathroom scale might only measure up to 140 kg. If a 160-kg individual is weighed using the scale, the observer would only know that the individual's weight is at least 140 kg.

The problem of censored data, in which the observed value of some variable is partially known, is related to the problem of missing data, where the observed value of some variable is unknown.

Censoring should not be confused with the related idea truncation. With censoring, observations result either in knowing the exact value that applies, or in knowing that the value lies within an interval. With truncation, observations never result in values outside a given range: values in the population outside the range are never seen or never recorded if they are seen. Note that in statistics, truncation is not the same as rounding.

Types

Left censoring – a data point is below a certain value but it is unknown by how much.
Interval censoring – a data point is somewhere on an interval between two values.
Right censoring – a data point is above a certain value but it is unknown by how much.
Type I censoring occurs if an experiment has a set number of subjects or items and stops the experiment at a predetermined time, at which point any subjects remaining are right-censored.
Type II censoring occurs if an experiment has a set number of subjects or items and stops the experiment when a predetermined number are observed to have failed; the remaining subjects are then right-censored.
Random (or non-informative) censoring is when each subject has a censoring time that is
statistically independent
of their failure time. The observed value is the minimum of the censoring and failure times; subjects whose failure time is greater than their censoring time are right-censored.

Interval censoring can occur when observing a value requires follow-ups or inspections. Left and right censoring are special cases of interval censoring, with the beginning of the interval at zero or the end at infinity, respectively.

Estimation methods for using left-censored data vary, and not all methods of estimation may be applicable to, or the most reliable, for all data sets.^[1]

A common misconception with time interval data is to class as left censored intervals where the start time is unknown. In these cases we have a lower bound on the time interval, thus the data is right censored (despite the fact that the missing start point is to the left of the known interval when viewed as a timeline!).

Analysis

Special techniques may be used to handle censored data. Tests with specific failure times are coded as actual failures; censored data are coded for the type of censoring and the known interval or limit. Special software programs (often reliability oriented) can conduct a maximum likelihood estimation for summary statistics, confidence intervals, etc.

Epidemiology

One of the earliest attempts to analyse a statistical problem involving censored data was Daniel Bernoulli's 1766 analysis of smallpox morbidity and mortality data to demonstrate the efficacy of vaccination.^[2] An early paper to use the Kaplan–Meier estimator for estimating censored costs was Quesenberry et al. (1989),^[3] however this approach was found to be invalid by Lin et al.^[4] unless all patients accumulated costs with a common deterministic rate function over time, they proposed an alternative estimation technique known as the Lin estimator.^[5]

Operating life testing

Reliability testing often consists of conducting a test on an item (under specified conditions) to determine the time it takes for a failure to occur.

Sometimes a failure is planned and expected but does not occur: operator error, equipment malfunction, test anomaly, etc. The test result was not the desired time-to-failure but can be (and should be) used as a time-to-termination. The use of censored data is unintentional but necessary.
Sometimes engineers plan a test program so that, after a certain time limit or number of failures, all other tests will be terminated. These suspended times are treated as right-censored data. The use of censored data is intentional.

An analysis of the data from replicate tests includes both the times-to-failure for the items that failed and the time-of-test-termination for those that did not fail.

Censored regression

An earlier model for censored regression, the tobit model, was proposed by James Tobin in 1958.^[6]

Likelihood

The likelihood is the probability or probability density of what was observed, viewed as a function of parameters in an assumed model. To incorporate censored data points in the likelihood the censored data points are represented by the probability of the censored data points as a function of the model parameters given a model, i.e. a function of CDF(s) instead of the density or probability mass.

The most general censoring case is interval censoring: $Pr(a<x\leqslant b)=F(b)-F(a)$ , where $F(x)$ is the CDF of the probability distribution, and the two special cases are:

left censoring: $Pr(-\infty <x\leqslant b)=F(b)-F(-\infty )=F(b)-0=F(b)=Pr(x\leqslant b)$

right censoring: $Pr(a<x\leqslant \infty )=F(\infty )-F(a)=1-F(a)=1-Pr(x\leqslant a)=Pr(x>a)$

For continuous probability distributions: $Pr(a<x\leqslant b)=Pr(a<x<b)$

Example

Suppose we are interested in survival times, $T_{1},T_{2},...,T_{n}$ , but we don't observe $T_{i}$ for all $i$ . Instead, we observe

(U_{i},\delta _{i})

, with

U_{i}=T_{i}

and

\delta _{i}=1

if

T_{i}

is actually observed, and

(U_{i},\delta _{i})

, with

U_{i}<T_{i}

and

\delta _{i}=0

if all we know is that

T_{i}

is longer than

U_{i}

.

When $T_{i}>U_{i},U_{i}$ is called the censoring time.^[7]

If the censoring times are all known constants, then the likelihood is

L=\prod _{i,\delta _{i}=1}f(u_{i})\prod _{i,\delta _{i}=0}S(u_{i})

where $f(u_{i})$ = the probability density function evaluated at $u_{i}$ ,

and $S(u_{i})$ = the probability that $T_{i}$ is greater than $u_{i}$ , called the survival function.

This can be simplified by defining the hazard function, the instantaneous force of mortality, as

\lambda (u)=f(u)/S(u)

so

f(u)=\lambda (u)S(u)

.

Then

L=\prod _{i}\lambda (u_{i})^{\delta _{i}}S(u_{i})

.

For the exponential distribution, this becomes even simpler, because the hazard rate, $\lambda$ , is constant, and $S(u)=\exp(-\lambda u)$ . Then:

L(\lambda )=\lambda ^{k}\exp(-\lambda \sum {u_{i}})

,

where $k=\sum {\delta _{i}}$ .

From this we easily compute ${\hat {\lambda }}$ , the maximum likelihood estimate (MLE) of $\lambda$ , as follows:

l(\lambda )=\log(L(\lambda ))=k\log(\lambda )-\lambda \sum {u_{i}}

.

Then

dl/d\lambda =k/\lambda -\sum {u_{i}}

.

We set this to 0 and solve for $\lambda$ to get:

{\hat {\lambda }}=k/\sum u_{i}

.

Equivalently, the mean time to failure is:

1/{\hat {\lambda }}=\sum u_{i}/k

.

This differs from the standard MLE for the exponential distribution in that the any censored observations are considered only in the numerator.

References

PMID 20032004
.

^ Bernoulli, D. (1766). "Essai d'une nouvelle analyse de la mortalité causée par la petite vérole". Mem. Math. Phy. Acad. Roy. Sci. Paris, reprinted in Bradley (1971) 21 and Blower (2004)

PMID 2817192
.

PMID 9192444
.

PMID 22719214
.

JSTOR 1907382
.

Wikidata Q98961801
.

Further reading

Blower, S. (2004), D, Bernoulli's ""An attempt at a new analysis of the mortality caused by smallpox and of the advantages of inoculation to prevent it" (PDF). Archived from the original (PDF) on 2017-08-08. Retrieved 2019-06-25. (146
KiB
)", Reviews of Medical Virology, 14: 275–288

Bradley, L. (1971). Smallpox Inoculation: An Eighteenth Century Mathematical Controversy. Nottingham.
ISBN 0-902031-23-6.{{cite book}}: CS1 maint: location missing publisher (link
)

ISBN 047156737X
.

Bagdonavicius, V., Kruopis, J., Nikulin, M.S. (2011),"Non-parametric Tests for Censored Data", London, ISTE/WILEY,
ISBN 9781848212893
.

External links

"Engineering Statistics Handbook", NIST/SEMATEK, [1]

v
t
e
Statistics

Outline

Index

Continuous data
Center

Mean
Arithmetic

Arithmetic-Geometric

Cubic

Generalized/power

Geometric

Harmonic

Heronian

Heinz

Lehmer

Median

Mode

Dispersion

Average absolute deviation

Coefficient of variation

Interquartile range

Percentile

Range

Standard deviation

Variance

Shape

Central limit theorem

Moments
Kurtosis

L-moments

Skewness

Count data

Index of dispersion

Summary tables

Contingency table

Frequency distribution

Grouped data

Dependence

Partial correlation

Pearson product-moment correlation

Rank correlation
Kendall's τ

Spearman's ρ

Scatter plot

Graphics

Bar chart

Biplot

Box plot

Control chart

Correlogram

Fan chart

Forest plot

Histogram

Pie chart

Q–Q plot

Radar chart

Run chart

Scatter plot

Stem-and-leaf display

Violin plot

Data collection
Study design

Effect size

Missing data

Optimal design

Population

Replication

Sample size determination

Statistic

Statistical power

Survey methodology

Sampling
Cluster

Stratified

Opinion poll

Questionnaire

Standard error

Controlled experiments

Blocking

Factorial experiment

Interaction

Random assignment

Randomized controlled trial

Randomized experiment

Scientific control

Adaptive designs

Adaptive clinical trial

Stochastic approximation

Up-and-down designs

Observational studies

Cohort study

Cross-sectional study

Natural experiment

Quasi-experiment

Statistical inference
Statistical theory

Population

Statistic

Probability distribution

Sampling distribution
Order statistic

Empirical distribution
Density estimation

Statistical model
Model specification

L^p space

Parameter
location

scale

shape

Parametric family
Likelihood (monotone)

Location–scale family

Exponential family

Completeness

Sufficiency

Statistical functional

Bootstrap

U

V

Optimal decision
loss function

Efficiency

Statistical distance
divergence

Asymptotics

Robustness

Frequentist inference
Point estimation

Estimating equations
Maximum likelihood

Method of moments

M-estimator

Minimum distance

Unbiased estimators
Mean-unbiased minimum-variance
Rao–Blackwellization

Lehmann–Scheffé theorem

Median unbiased

Plug-in

Interval estimation

Confidence interval

Pivot

Likelihood interval

Prediction interval

Tolerance interval

Resampling
Bootstrap

Jackknife

Testing hypotheses

1- & 2-tails

Power

Uniformly most powerful test

Permutation test
Randomization test

Multiple comparisons

Parametric tests

Likelihood-ratio

Score/Lagrange multiplier

Wald

Specific tests

Z-test (normal)

Student's t-test

F-test

Goodness of fit

Chi-squared

G-test

Kolmogorov–Smirnov

Anderson–Darling

Lilliefors

Jarque–Bera

Normality (Shapiro–Wilk)

Likelihood-ratio test

Model selection
Cross validation

AIC

BIC

Rank statistics

Sign
Sample median

Signed rank (Wilcoxon)
Hodges–Lehmann estimator

Rank sum (Mann–Whitney)

Nonparametric anova
1-way (Kruskal–Wallis)

2-way (Friedman)

Ordered alternative (Jonckheere–Terpstra)

Van der Waerden test

Bayesian inference

Bayesian probability
prior

posterior

Credible interval

Bayes factor

Bayesian estimator
Maximum posterior estimator

Correlation

Pearson product-moment

Partial correlation

Confounding variable

Coefficient of determination

Regression analysis

Errors and residuals

Regression validation

Mixed effects models

Simultaneous equations models

Multivariate adaptive regression splines (MARS)

Linear regression

Simple linear regression

Ordinary least squares

General linear model

Bayesian regression

Non-standard predictors

Nonlinear regression

Nonparametric

Semiparametric

Isotonic

Robust

Heteroscedasticity

Homoscedasticity

Generalized linear model

Exponential families

Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance

Analysis of variance (ANOVA, anova)

Analysis of covariance

Multivariate ANOVA

Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis
Categorical

Cohen's kappa

Contingency table

Graphical model

Log-linear model

McNemar's test

Cochran–Mantel–Haenszel statistics

Multivariate

Regression

Manova

Principal components

Canonical correlation

Discriminant analysis

Cluster analysis

Classification

Structural equation model
Factor analysis

Multivariate distributions

Elliptical distributions
Normal

Time-series
General

Decomposition

Trend

Stationarity

Seasonal adjustment

Exponential smoothing

Cointegration

Structural break

Granger causality

Specific tests

Dickey–Fuller

Johansen

Q-statistic (Ljung–Box)

Durbin–Watson

Breusch–Godfrey

Time domain

Autocorrelation (ACF)
partial (PACF)

Cross-correlation (XCF)

ARMA model

ARIMA model (Box–Jenkins)

Autoregressive conditional heteroskedasticity (ARCH)

Vector autoregression (VAR)

Frequency domain

Spectral density estimation

Fourier analysis

Least-squares spectral analysis

Wavelet

Whittle likelihood

Survival
Survival function

Kaplan–Meier estimator (product limit)

Proportional hazards models

Accelerated failure time (AFT) model

First hitting time

Hazard function

Nelson–Aalen estimator

Test

Log-rank test

Applications
Biostatistics

Bioinformatics

Clinical trials / studies

Epidemiology

Medical statistics

Engineering statistics

Chemometrics

Methods engineering

Probabilistic design

Process / quality control

Reliability

System identification

Social statistics

Actuarial science

Census

Crime statistics

Demography

Econometrics

Jurimetrics

National accounts

Official statistics

Population statistics

Psychometrics

Spatial statistics

Cartography

Environmental statistics

Geographic information system

Geostatistics

Kriging

Category

Mathematics portal

Commons

WikiProject

Retrieved from "https://en.wikipedia.org/w/index.php?title=Censoring_(statistics)&oldid=1183020552"

[1] PMID 20032004
.

[2] Bernoulli, D. (1766). "Essai d'une nouvelle analyse de la mortalité causée par la petite vérole". Mem. Math. Phy. Acad. Roy. Sci. Paris, reprinted in Bradley (1971) 21 and Blower (2004)

[3] PMID 2817192
.

[4] PMID 9192444
.

[5] PMID 22719214
.

[6] JSTOR 1907382
.

[7] Wikidata Q98961801
.

[1]

[2]

[3]

[4]

[5]

[6]

[7]