Binomial regression
Part of a series on |
Regression analysis |
---|
Models |
|
Estimation |
|
Background |
|
In
Binomial regression is closely related to binary regression: a binary regression can be considered a binomial regression with , or a regression on
Example application
In one published example of an application of binomial regression,[3] the details were as follows. The observed outcome variable was whether or not a fault occurred in an industrial process. There were two explanatory variables: the first was a simple two-case factor representing whether or not a modified version of the process was used and the second was an ordinary quantitative variable measuring the purity of the material being supplied for the process.
Specification of model
The response variable Y is assumed to be binomially distributed conditional on the explanatory variables X. The number of trials n is known, and the probability of success for each trial p is specified as a function θ(X). This implies that the conditional expectation and conditional variance of the observed fraction of successes, Y/n, are
The goal of binomial regression is to estimate the function θ(X). Typically the statistician assumes , for a known function m, and estimates β. Common choices for m include the logistic function.[1]
The data are often fitted as a
where 1A is the
Models used in binomial regression can often be extended to multinomial data.
There are many methods of generating the values of μ in systematic ways that allow for interpretation of the model; they are discussed below.
Link functions
There is a requirement that the modelling linking the probabilities μ to the explanatory variables should be of a form which only produces values in the range 0 to 1. Many models can be fitted into the form
Here η is an intermediate variable representing a linear combination, containing the regression parameters, of the explanatory variables. The function g is the cumulative distribution function (cdf) of some probability distribution. Usually this probability distribution has a support from minus infinity to plus infinity so that any finite value of η is transformed by the function g to a value inside the range 0 to 1.
In the case of logistic regression, the link function is the log of the odds ratio or logistic function. In the case of probit, the link is the cdf of the normal distribution. The linear probability model is not a proper binomial regression specification because predictions need not be in the range of zero to one; it is sometimes used for this type of data when the probability space is where interpretation occurs or when the analyst lacks sufficient sophistication to fit or calculate approximate linearizations of probabilities for interpretation.
Comparison with binary regression
Binomial regression is closely connected with binary regression. If the response is a
Comparison with binary choice models
A binary choice model assumes a
where is a set of
The person takes the action, yn = 1, if Un > 0. The unobserved term, εn, is assumed to have a logistic distribution.
The specification is written succinctly as:
Let us write it slightly differently:
Here we have made the substitution en = −εn. This changes a random variable into a slightly different one, defined over a negated domain. As it happens, the error distributions we usually consider (e.g. logistic distribution, standard normal distribution, standard Student's t-distribution, etc.) are symmetric about 0, and hence the distribution over en is identical to the distribution over εn.
Denote the cumulative distribution function (CDF) of as and the quantile function (inverse CDF) of as
Note that
Since is a Bernoulli trial, where we have
or equivalently
Note that this is exactly equivalent to the binomial regression model expressed in the formalism of the generalized linear model.
If i.e. distributed as a
which is exactly a probit model.
If i.e. distributed as a standard
which is exactly a
Note that the two different formalisms — generalized linear models (GLM's) and discrete choice models — are equivalent in the case of simple binary choice models, but can be extended if differing ways:
- GLM's can easily handle arbitrarily distributed error variable, which must by assumption have a probability distribution.
- On the other hand, because discrete choice models are described as types of generative models, it is conceptually easier to extend them to complicated situations with multiple, possibly correlated, choices for each person, or other variations.
Latent variable interpretation / derivation
A latent variable model involving a binomial observed variable Y can be constructed such that Y is related to the latent variable Y* via
The latent variable Y* is then related to a set of regression variables X by the model
This results in a binomial regression model.
The variance of ϵ can not be identified and when it is not of interest is often assumed to be equal to one. If ϵ is normally distributed, then a probit is the appropriate model and if ϵ is log-Weibull distributed, then a logit is appropriate. If ϵ is uniformly distributed, then a linear probability model is appropriate.
See also
Notes
- ^ ISBN 0-471-66379-4.
- ^ a b Rodríguez 2007, Chapter 3, p. 5.
- ^ Cox & Snell (1981), Example H, p. 91
References
- ISBN 0-412-16570-8.
- Rodríguez, Germán (2007). "Lecture Notes on Generalized Linear Models".
Further reading
- Dean, C. B. (1992). "Testing for Overdispersion in Poisson and Binomial Regression Models". Journal of the American Statistical Association. 87 (418). Informa UK Limited: 451–457. JSTOR 2290276.