Errors and residuals
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (September 2016) |
Part of a series on |
Regression analysis |
---|
Models |
Estimation |
|
Background |
|
In
Introduction
Suppose there is a series of observations from a
A statistical error (or disturbance) is the amount by which an observation differs from its expected value, the latter being based on the whole population from which the statistical unit was chosen randomly. For example, if the mean height in a population of 21-year-old men is 1.75 meters, and one randomly chosen man is 1.80 meters tall, then the "error" is 0.05 meters; if the randomly chosen man is 1.70 meters tall, then the "error" is −0.05 meters. The expected value, being the mean of the entire population, is typically unobservable, and hence the statistical error cannot be observed either.
A residual (or fitting deviation), on the other hand, is an observable estimate of the unobservable statistical error. Consider the previous example with men's heights and suppose we have a random sample of n people. The
- The difference between the height of each man in the sample and the unobservable population mean is a statistical error, whereas
- The difference between the height of each man in the sample and the observable sample mean is a residual.
Note that, because of the definition of the sample mean, the sum of the residuals within a random sample is necessarily zero, and thus the residuals are necessarily not
One can standardize statistical errors (especially of a
In univariate distributions
If we assume a normally distributed population with mean μ and standard deviation σ, and choose individuals independently, then we have
and the sample mean
is a random variable distributed such that:
The statistical errors are then
with
The sum of squares of the statistical errors, divided by σ2, has a chi-squared distribution with n degrees of freedom:
However, this quantity is not observable as the population mean is unknown. The sum of squares of the residuals, on the other hand, is observable. The quotient of that sum by σ2 has a chi-squared distribution with only n − 1 degrees of freedom:
This difference between n and n − 1 degrees of freedom results in
Remark
It is remarkable that the
where represents the errors, represents the sample standard deviation for a sample of size n, and unknown σ, and the denominator term accounts for the standard deviation of the errors according to:[5]
The probability distributions of the numerator and the denominator separately depend on the value of the unobservable population standard deviation σ, but σ appears in both the numerator and the denominator and cancels. That is fortunate because it means that even though we do not know σ, we know the probability distribution of this quotient: it has a Student's t-distribution with n − 1 degrees of freedom. We can therefore use this quotient to find a confidence interval for μ. This t-statistic can be interpreted as "the number of standard errors away from the regression line."[6]
Regressions
In
However, a terminological difference arises in the expression mean squared error (MSE). The mean squared error of a regression is a number computed from the sum of squares of the computed residuals, and not of the unobservable errors. If that sum of squares is divided by n, the number of observations, the result is the mean of the squared residuals. Since this is a biased estimate of the variance of the unobserved errors, the bias is removed by dividing the sum of the squared residuals by df = n − p − 1, instead of n, where df is the number of degrees of freedom (n minus the number of parameters (excluding the intercept) p being estimated - 1). This forms an unbiased estimate of the variance of the unobserved errors, and is called the mean squared error.[7]
Another method to calculate the mean square of error when analyzing the variance of linear regression using a technique like that used in
However, because of the behavior of the process of regression, the distributions of residuals at different data points (of the input variable) may vary even if the errors themselves are identically distributed. Concretely, in a
Thus to compare residuals at different inputs, one needs to adjust the residuals by the expected variability of residuals, which is called
Other uses of the word "error" in statistics
The use of the term "error" as discussed in the sections above is in the sense of a deviation of a value from a hypothetical unobserved value. At least two other uses also occur in statistics, both referring to observable
The
Likewise, the
The mean error (ME) is the bias. The mean residual (MR) is always zero for least-squares estimators.
See also
- Absolute deviation
- Consensus forecasts
- Error detection and correction
- Explained sum of squares
- Innovation (signal processing)
- Lack-of-fit sum of squares
- Margin of error
- Mean absolute error
- Observational error
- Propagation of error
- Probable error
- Random and systematic errors
- Reduced chi-squared statistic
- Regression dilution
- Root mean square deviation
- Sampling error
- Standard error
- Studentized residual
- Type I and type II errors
References
- ISBN 978-1-4051-8257-7. Retrieved 2022-05-13.
- ISBN 978-1-337-67133-0. Retrieved 2022-05-13.
- ISBN 978-981-329-019-8. Retrieved 2022-05-13.
- OCLC 7779780.
- ^ OCLC 262680588.
- OCLC 987251007.
- ^ Steel, Robert G. D.; Torrie, James H. (1960). Principles and Procedures of Statistics, with Special Reference to Biological Sciences. McGraw-Hill. p. 288.
- ISBN 9780521761598.
- ^ "7.3: Types of Outliers in Linear Regression". Statistics LibreTexts. 2013-11-21. Retrieved 2019-11-22.
Further reading
- Cook, R. Dennis; Weisberg, Sanford (1982). Residuals and Influence in Regression (Repr. ed.). New York: ISBN 041224280X. Retrieved 23 February 2013.
- JSTOR 2984505.
- Weisberg, Sanford (1985). Applied Linear Regression (2nd ed.). New York: Wiley. ISBN 9780471879572. Retrieved 23 February 2013.
- "Errors, theory of", Encyclopedia of Mathematics, EMS Press, 2001 [1994]
External links
- Media related to Errors and residuals at Wikimedia Commons