Binomial regression

In

response (often referred to as Y) has a binomial distribution

: it is the number of successes in a series of ⁠

n

⁠ independent Bernoulli trials, where each trial has probability of success ⁠

p

⁠.^{explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.
Binomial regression is closely related to binary regression: a binary regression can be considered a binomial regression with $n=1$ , or a regression on
binary choice models, one type of discrete choice model: the primary difference is in the theoretical motivation (see comparison). In machine learning, binomial regression is considered a special case of probabilistic classification, and thus a generalization of binary classification
.

Example application
In one published example of an application of binomial regression,[3] the details were as follows. The observed outcome variable was whether or not a fault occurred in an industrial process. There were two explanatory variables: the first was a simple two-case factor representing whether or not a modified version of the process was used and the second was an ordinary quantitative variable measuring the purity of the material being supplied for the process.

Specification of model
The response variable Y is assumed to be binomially distributed conditional on the explanatory variables X. The number of trials n is known, and the probability of success for each trial p is specified as a function θ(X). This implies that the conditional expectation and conditional variance of the observed fraction of successes, Y/n, are

$E(Y/n\mid X)=\theta (X)$
$\operatorname {Var} (Y/n\mid X)=\theta (X)(1-\theta (X))/n$
The goal of binomial regression is to estimate the function θ(X). Typically the statistician assumes $\theta (X)=m(\beta ^{\mathrm {T} }X)$ , for a known function m, and estimates β. Common choices for m include the logistic function.^[1]
The data are often fitted as a
likelihood of the predictions is then given by

$L({\boldsymbol {\mu }}\mid Y)=\prod _{i=1}^{n}\left(1_{y_{i}=1}(\mu _{i})+1_{y_{i}=0}(1-\mu _{i})\right),\,\!$
where 1_A is the
maximum likelihood to determine these parameters. In practice, the use of a formulation as a generalised linear model allows advantage to be taken of certain algorithmic ideas which are applicable across the whole class of more general models but which do not apply to all maximum likelihood problems.
Models used in binomial regression can often be extended to multinomial data.
There are many methods of generating the values of μ in systematic ways that allow for interpretation of the model; they are discussed below.

Link functions
There is a requirement that the modelling linking the probabilities μ to the explanatory variables should be of a form which only produces values in the range 0 to 1. Many models can be fitted into the form

${\boldsymbol {\mu }}=g({\boldsymbol {\eta }})\,.$
Here η is an intermediate variable representing a linear combination, containing the regression parameters, of the explanatory variables. The function
g is the cumulative distribution function (cdf) of some probability distribution. Usually this probability distribution has a support from minus infinity to plus infinity so that any finite value of η is transformed by the function g to a value inside the range 0 to 1.
In the case of logistic regression, the link function is the log of the odds ratio or logistic function. In the case of probit, the link is the cdf of the normal distribution. The linear probability model is not a proper binomial regression specification because predictions need not be in the range of zero to one; it is sometimes used for this type of data when the probability space is where interpretation occurs or when the analyst lacks sufficient sophistication to fit or calculate approximate linearizations of probabilities for interpretation.

Comparison with binary regression
Binomial regression is closely connected with binary regression. If the response is a
binary variable (two possible outcomes), then these alternatives can be coded as 0 or 1 by considering one of the outcomes as "success" and the other as "failure" and considering these as count data
: "success" is 1 success out of 1 trial, while "failure" is 0 successes out of 1 trial. This can now be considered a binomial distribution with $n=1$ trial, so a binary regression is a special case of a binomial regression. If these data are grouped (by adding counts), they are no longer binary data, but are count data for each group, and can still be modeled by a binomial regression; the individual binary outcomes are then referred to as "ungrouped data". An advantage of working with grouped data is that one can test the goodness of fit of the model;^[2] for example, grouped data may exhibit overdispersion relative to the variance estimated from the ungrouped data.

Comparison with binary choice models
A binary choice model assumes a
latent variable
U_n, the utility (or net benefit) that person n obtains from taking an action (as opposed to not taking the action). The utility the person obtains from taking the action depends on the characteristics of the person, some of which are observed by the researcher and some are not:

$U_{n}={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} +\varepsilon _{n}$
where ${\boldsymbol {\beta }}$ is a set of
regression coefficients
and $\mathbf {s_{n}}$ is a set of independent variables (also known as "features") describing person n, which may be either discrete "dummy variables
" or regular continuous variables. $\varepsilon _{n}$ is a random variable specifying "noise" or "error" in the prediction, assumed to be distributed according to some distribution. Normally, if there is a mean or variance parameter in the distribution, it cannot be identified, so the parameters are set to convenient values — by convention usually mean 0, variance 1.
The person takes the action, y_n = 1, if U_n > 0. The unobserved term, ε_n, is assumed to have a logistic distribution.
The specification is written succinctly as:

U_n = βs_n + ε_n
$Y_{n}={\begin{cases}1,&{\text{if }}U_{n}>0,\\0,&{\text{if }}U_{n}\leq 0\end{cases}}$
ε ∼ logistic, standard normal, etc.
Let us write it slightly differently:

U_n = βs_n − e_n
$Y_{n}={\begin{cases}1,&{\text{if }}U_{n}>0,\\0,&{\text{if }}U_{n}\leq 0\end{cases}}$
e ∼ logistic, standard normal, etc.
Here we have made the substitution e_n = −ε_n. This changes a random variable into a slightly different one, defined over a negated domain. As it happens, the error distributions we usually consider (e.g. logistic distribution, standard normal distribution, standard Student's t-distribution, etc.) are symmetric about 0, and hence the distribution over e_n is identical to the distribution over ε_n.
Denote the cumulative distribution function (CDF) of $e$ as $F_{e},$ and the quantile function (inverse CDF) of $e$ as $F_{e}^{-1}.$
Note that

${\begin{aligned}\Pr(Y_{n}=1)&=\Pr(U_{n}>0)\\[6pt]&=\Pr({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} -e_{n}>0)\\[6pt]&=\Pr(-e_{n}>-{\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\\[6pt]&=\Pr(e_{n}\leq {\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\\[6pt]&=F_{e}({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )\end{aligned}}$
Since $Y_{n}$ is a Bernoulli trial, where $\mathbb {E} [Y_{n}]=\Pr(Y_{n}=1),$ we have

$\mathbb {E} [Y_{n}]=F_{e}({\boldsymbol {\beta }}\cdot \mathbf {s_{n}} )$
or equivalently

$F_{e}^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}} .$
Note that this is exactly equivalent to the binomial regression model expressed in the formalism of the generalized linear model.
If $e_{n}\sim {\mathcal {N}}(0,1),$ i.e. distributed as a
standard normal distribution, then

$\Phi ^{-1}(\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}}$
which is exactly a probit model.
If $e_{n}\sim \operatorname {Logistic} (0,1),$ i.e. distributed as a standard
logit function, and

$\operatorname {logit} (\mathbb {E} [Y_{n}])={\boldsymbol {\beta }}\cdot \mathbf {s_{n}}$
which is exactly a
logit model.
Note that the two different formalisms — generalized linear models (GLM's) and discrete choice models — are equivalent in the case of simple binary choice models, but can be extended if differing ways:

GLM's can easily handle arbitrarily distributed error variable, which must by assumption have a probability distribution
.

On the other hand, because discrete choice models are described as types of generative models, it is conceptually easier to extend them to complicated situations with multiple, possibly correlated, choices for each person, or other variations.
Latent variable interpretation / derivation
A latent variable model involving a binomial observed variable Y can be constructed such that Y is related to the latent variable Y* via

$Y={\begin{cases}0,&{\mbox{if }}Y^{*}>0\\1,&{\mbox{if }}Y^{*}<0.\end{cases}}$
The latent variable Y* is then related to a set of regression variables X by the model

$Y^{*}=X\beta +\epsilon \ .$
This results in a binomial regression model.
The variance of ϵ can not be identified and when it is not of interest is often assumed to be equal to one. If ϵ is normally distributed, then a probit is the appropriate model and if ϵ is log-Weibull distributed, then a logit is appropriate. If ϵ is uniformly distributed, then a linear probability model is appropriate.

See also
Linear probability model
Poisson regression
Predictive modelling
Notes

^ ISBN 0-471-66379-4
.

^ ^a ^b Rodríguez 2007, Chapter 3, p. 5.

^ Cox & Snell (1981), Example H, p. 91

References
ISBN 0-412-16570-8
.

Rodríguez, Germán (2007). "Lecture Notes on Generalized Linear Models".
Further reading
Dean, C. B. (1992). "Testing for Overdispersion in Poisson and Binomial Regression Models". Journal of the American Statistical Association. 87 (418). Informa UK Limited: 451–457. JSTOR 2290276
.

v
t
e
Statistics

Outline
Index
Continuous data
Center
Mean
Arithmetic
Arithmetic-Geometric
Contraharmonic
Cubic
Generalized/power
Geometric
Harmonic
Heronian
Heinz
Lehmer
Median
Mode
Dispersion
Average absolute deviation
Coefficient of variation
Interquartile range
Percentile
Range
Standard deviation
Variance
Shape
Central limit theorem
Moments
Kurtosis
L-moments
Skewness
Count data
Index of dispersion
Summary tables
Contingency table
Frequency distribution
Grouped data
Dependence
Partial correlation
Pearson product-moment correlation
Rank correlation
Kendall's τ
Spearman's ρ
Scatter plot
Graphics
Bar chart
Biplot
Box plot
Control chart
Correlogram
Fan chart
Forest plot
Histogram
Pie chart
Q–Q plot
Radar chart
Run chart
Scatter plot
Stem-and-leaf display
Violin plot
Data collection
Study design
Effect size
Missing data
Optimal design
Population
Replication
Sample size determination
Statistic
Statistical power
Survey methodology
Sampling
Cluster
Stratified
Opinion poll
Questionnaire
Standard error
Controlled experiments
Blocking
Factorial experiment
Interaction
Random assignment
Randomized controlled trial
Randomized experiment
Scientific control
Adaptive designs
Adaptive clinical trial
Stochastic approximation
Up-and-down designs
Observational studies
Cohort study
Cross-sectional study
Natural experiment
Quasi-experiment
Statistical inference
Statistical theory
Population
Statistic
Probability distribution
Sampling distribution
Order statistic
Empirical distribution
Density estimation
Statistical model
Model specification
L^p space
Parameter
location
scale
shape
Parametric family
Likelihood (monotone)
Location–scale family
Exponential family
Completeness
Sufficiency
Statistical functional
Bootstrap
U
V
Optimal decision
loss function
Efficiency
Statistical distance
divergence
Asymptotics
Robustness
Frequentist inference
Point estimation
Estimating equations
Maximum likelihood
Method of moments
M-estimator
Minimum distance
Unbiased estimators
Mean-unbiased minimum-variance
Rao–Blackwellization
Lehmann–Scheffé theorem
Median unbiased
Plug-in
Interval estimation
Confidence interval
Pivot
Likelihood interval
Prediction interval
Tolerance interval
Resampling
Bootstrap
Jackknife
Testing hypotheses
1- & 2-tails
Power
Uniformly most powerful test
Permutation test
Randomization test
Multiple comparisons
Parametric tests
Likelihood-ratio
Score/Lagrange multiplier
Wald
Specific tests

Z-test (normal)
Student's t-test
F-test
Goodness of fit
Chi-squared
G-test
Kolmogorov–Smirnov
Anderson–Darling
Lilliefors
Jarque–Bera
Normality (Shapiro–Wilk)
Likelihood-ratio test
Model selection
Cross validation
AIC
BIC
Rank statistics
Sign
Sample median
Signed rank (Wilcoxon)
Hodges–Lehmann estimator
Rank sum (Mann–Whitney)
Nonparametric anova
1-way (Kruskal–Wallis)
2-way (Friedman)
Ordered alternative (Jonckheere–Terpstra)
Van der Waerden test
Bayesian inference
Bayesian probability
prior
posterior
Credible interval
Bayes factor
Bayesian estimator
Maximum posterior estimator
Correlation
Pearson product-moment
Partial correlation
Confounding variable
Coefficient of determination
Regression analysis
Errors and residuals
Regression validation
Mixed effects models
Simultaneous equations models
Multivariate adaptive regression splines (MARS)
Linear regression
Simple linear regression
Ordinary least squares
General linear model
Bayesian regression
Non-standard predictors
Nonlinear regression
Nonparametric
Semiparametric
Isotonic
Robust
Homoscedasticity and Heteroscedasticity
Generalized linear model
Exponential families
Logistic (Bernoulli) / Binomial / Poisson regressions
Partition of variance
Analysis of variance (ANOVA, anova)
Analysis of covariance
Multivariate ANOVA
Degrees of freedom
Categorical / Multivariate / Time-series / Survival analysis
Categorical
Cohen's kappa
Contingency table
Graphical model
Log-linear model
McNemar's test
Cochran–Mantel–Haenszel statistics
Multivariate
Regression
Manova
Principal components
Canonical correlation
Discriminant analysis
Cluster analysis
Classification
Structural equation model
Factor analysis
Multivariate distributions
Elliptical distributions
Normal
Time-series
General
Decomposition
Trend
Stationarity
Seasonal adjustment
Exponential smoothing
Cointegration
Structural break
Granger causality
Specific tests
Dickey–Fuller
Johansen
Q-statistic (Ljung–Box)
Durbin–Watson
Breusch–Godfrey
Time domain
Autocorrelation (ACF)
partial (PACF)
Cross-correlation (XCF)
ARMA model
ARIMA model (Box–Jenkins)
Autoregressive conditional heteroskedasticity (ARCH)
Vector autoregression (VAR)
Frequency domain
Spectral density estimation
Fourier analysis
Least-squares spectral analysis
Wavelet
Whittle likelihood
Survival
Survival function
Kaplan–Meier estimator (product limit)
Proportional hazards models
Accelerated failure time (AFT) model
First hitting time
Hazard function
Nelson–Aalen estimator
Test
Log-rank test
Applications
Biostatistics
Bioinformatics
Clinical trials / studies
Epidemiology
Medical statistics
Engineering statistics
Chemometrics
Methods engineering
Probabilistic design
Process / quality control
Reliability
System identification
Social statistics
Actuarial science
Census
Crime statistics
Demography
Econometrics
Jurimetrics
National accounts
Official statistics
Population statistics
Psychometrics
Spatial statistics
Cartography
Environmental statistics
Geographic information system
Geostatistics
Kriging

Category
Mathematics portal
Commons
WikiProject}

[Weisberg-1] 
ISBN 0-471-66379-4
.

[FOOTNOTERodríguez2007Chapter_3,_p._5-2] Rodríguez 2007, Chapter 3, p. 5.

[3] Cox & Snell (1981), Example H, p. 91

binary regression

discrete choice

comparison

machine learning

probabilistic classification

binary classification

[3]

binomially distributed

conditional expectation

conditional variance

logistic function

[1]

cumulative distribution function

probability distribution

support

logistic regression

odds ratio

probit

normal distribution

linear probability model

count data

grouped

[2]

overdispersion

dummy variables

random variable

identified

logistic distribution

Student's t-distribution

quantile function

Bernoulli trial

generalized linear model

generative models

latent variable model

log-Weibull

Poisson regression

Predictive modelling

0-471-66379-4

^a

^b

Rodríguez 2007

^

p. 91

0-412-16570-8

"Lecture Notes on Generalized Linear Models"

2290276

v

t

e

Statistics

Outline

Index

Center

Mean

Arithmetic

Arithmetic-Geometric

Contraharmonic

Cubic

Generalized/power

Geometric

Harmonic

Heronian

Heinz

Lehmer

Median

Mode

Dispersion

Average absolute deviation

Coefficient of variation

Interquartile range

Percentile

Range

Standard deviation

Variance

Central limit theorem

Moments

Kurtosis

L-moments

Skewness

Index of dispersion

Contingency table

Partial correlation

Pearson product-moment correlation

Rank correlation

Kendall's τ

Spearman's ρ

Scatter plot

Graphics

Bar chart

Biplot

Box plot

Control chart

Correlogram

Fan chart

Forest plot

Histogram

Pie chart

Q–Q plot

Radar chart

Run chart