Biostatistics

Biostatistics (also known as biometry) is a branch of statistics that applies statistical methods to a wide range of topics in biology. It encompasses the design of biological experiments, the collection and analysis of data from those experiments and the interpretation of the results.

History

Biostatistics and genetics

Biostatistical modeling forms an important part of numerous modern biological theories. Genetics studies, since its beginning, used statistical concepts to understand observed experimental results. Some genetics scientists even contributed with statistical advances with the development of methods and tools. Gregor Mendel started the genetics studies investigating genetics segregation patterns in families of peas and used statistics to explain the collected data. In the early 1900s, after the rediscovery of Mendel's Mendelian inheritance work, there were gaps in understanding between genetics and evolutionary Darwinism. Francis Galton tried to expand Mendel's discoveries with human data and proposed a different model with fractions of the heredity coming from each ancestral composing an infinite series. He called this the theory of "Law of Ancestral Heredity". His ideas were strongly disagreed by William Bateson, who followed Mendel's conclusions, that genetic inheritance were exclusively from the parents, half from each of them. This led to a vigorous debate between the biometricians, who supported Galton's ideas, as Raphael Weldon, Arthur Dukinfield Darbishire and Karl Pearson, and Mendelians, who supported Bateson's (and Mendel's) ideas, such as Charles Davenport and Wilhelm Johannsen. Later, biometricians could not reproduce Galton conclusions in different experiments, and Mendel's ideas prevailed. By the 1930s, models built on statistical reasoning had helped to resolve these differences and to produce the neo-Darwinian modern evolutionary synthesis.

Solving these differences also allowed to define the concept of population genetics and brought together genetics and evolution. The three leading figures in the establishment of population genetics and this synthesis all relied on statistics and developed its use in biology.

ANOVA, p-value concepts, Fisher's exact test and Fisher's equation for population dynamics. He is credited for the sentence "Natural selection is a mechanism for generating an exceedingly high degree of improbability".^[2]

inbreeding coefficient

.

J. B. S. Haldane's book, The Causes of Evolution, reestablished natural selection as the premier mechanism of evolution by explaining it in terms of the mathematical consequences of Mendelian genetics. He also developed the theory of primordial soup.

These and other biostatisticians,

mathematical biologists, and statistically inclined geneticists helped bring together evolutionary biology and genetics into a consistent, coherent whole that could begin to be quantitatively

modeled.

In parallel to this overall development, the pioneering work of

D'Arcy Thompson

in On Growth and Form also helped to add quantitative discipline to biological study.

Despite the fundamental importance and frequent necessity of statistical reasoning, there may nonetheless have been a tendency among biologists to distrust or deprecate results which are not

Caltech, saying "Well, I am like a guy who is prospecting for gold along the banks of the Sacramento River in 1849. With a little intelligence, I can reach down and pick up big nuggets of gold. And as long as I can do that, I'm not going to let any people in my department waste scarce resources in placer mining."^[3]

Research planning

Any research in

experimental design, data collection methods, data analysis perspectives and costs involved. It is essential to carry the study based on the three basic principles of experimental statistics: randomization, replication

, and local control.

Research question

The research question will define the objective of a study. The research will be headed by the question, so it needs to be concise, at the same time it is focused on interesting and novel topics that may improve science and knowledge and that field. To define the way to ask the

scientific question, an exhaustive literature review might be necessary. So the research can be useful to add value to the scientific community.^[4]

Hypothesis definition

Once the aim of the study is defined, the possible answers to the research question can be proposed, transforming this question into a hypothesis. The main propose is called null hypothesis (H₀) and is usually based on a permanent knowledge about the topic or an obvious occurrence of the phenomena, sustained by a deep literature review. We can say it is the standard expected answer for the data under the situation in test. In general, H_O assumes no association between _treatments. On the other hand, the alternative hypothesis is the denial of H_O. It assumes some degree of association between the treatment and the outcome. Although, the hypothesis is sustained by question research and its expected and unexpected answers.^[4]

As an example, consider groups of similar animals (mice, for example) under two different diet systems. The research question would be: what is the best diet? In this case, H₀ would be that there is no difference between the two diets in mice metabolism (H₀: μ₁ = μ₂) and the alternative hypothesis would be that the diets have different effects over animals metabolism (H₁: μ₁ ≠ μ₂).

The hypothesis is defined by the researcher, according to his/her interests in answering the main question. Besides that, the alternative hypothesis can be more than one hypothesis. It can assume not only differences across observed parameters, but their degree of differences (i.e. higher or shorter).

Sampling

Usually, a study aims to understand an effect of a phenomenon over a population. In biology, a population is defined as all the individuals of a given species, in a specific area at a given time. In biostatistics, this concept is extended to a variety of collections possible of study. Although, in biostatistics, a population is not only the individuals, but the total of one specific component of their organisms, as the whole genome, or all the sperm cells, for animals, or the total leaf area, for a plant, for example.

It is not possible to take the

inferiority, equivalence, and superiority is a key in determining sample size.^[4]

Experimental design

split plot", "augmented blocks", and many others. All of the designs might include control plots, determined by the researcher, to provide an error estimation during inference

.

In

case–control or cohort.^[6]

Data collection

Data collection methods must be considered in research planning, because it highly influences the sample size and experimental design.

Data collection varies according to type of data. For

quantitative data

, collection is done by measuring numerical information using instruments.

In agriculture and biology studies, yield data and its components can be obtained by

metric measures

. However, pest and disease injuries in plats are obtained by observation, considering score scales for levels of damage. Especially, in genetic studies, modern methods for data collection in field and laboratory should be considered, as high-throughput platforms for phenotyping and genotyping. These tools allow bigger experiments, while turn possible evaluate many plots in lower time than a human-based only method for data collection. Finally, all data collected of interest must be stored in an organized data frame for further analysis.

Analysis and data interpretation

Descriptive tools

Data can be represented through tables or graphical representation, such as line charts, bar charts, histograms, scatter plot. Also, measures of central tendency and variability can be very useful to describe an overview of the data. Follow some examples:

Frequency tables

One type of table is the frequency table, which consists of data arranged in rows and columns, where the frequency is the number of occurrences or repetitions of data. Frequency can be:^[8]

Absolute: represents the number of times that a determined value appear;

N=f_{1}+f_{2}+f_{3}+...+f_{n}

Relative: obtained by the division of the absolute frequency by the total number;

n_{i}={\frac {f_{i}}{N}}

In the next example, we have the number of genes in ten operons of the same organism.

Genes = {2,3,3,4,5,3,3,3,3,4}


Genes number	Absolute frequency	Relative frequency
1	0	0
2	1	0.1
3	6	0.6
4	2	0.2
5	1	0.1

Line graph

Line graphs represent the variation of a value over another metric, such as time. In general, values are represented in the vertical axis, while the time variation is represented in the horizontal axis.^[10]

Bar chart

A bar chart is a graph that shows categorical data as bars presenting heights (vertical bar) or widths (horizontal bar) proportional to represent values. Bar charts provide an image that could also be represented in a tabular format.^[10]

In the bar chart example, we have the birth rate in Brazil for the December months from 2010 to 2016.^[9] The sharp fall in December 2016 reflects the outbreak of Zika virus in the birth rate in Brazil.

Histograms

The histogram (or frequency distribution) is a graphical representation of a dataset tabulated and divided into uniform or non-uniform classes. It was first introduced by Karl Pearson.^[11]

Scatter plot

A scatter plot is a mathematical diagram that uses Cartesian coordinates to display values of a dataset. A scatter plot shows the data as a set of points, each one presenting the value of one variable determining the position on the horizontal axis and another variable on the vertical axis.^[12] They are also called scatter graph, scatter chart, scattergram, or scatter diagram.^[13]

Mean

The arithmetic mean is the sum of a collection of values ( ${x_{1}+x_{2}+x_{3}+\cdots +x_{n}}$ ) divided by the number of items of this collection ( ${n}$ ).

{\bar {x}}={\frac {1}{n}}\left(\sum _{i=1}^{n}{x_{i}}\right)={\frac {x_{1}+x_{2}+\cdots +x_{n}}{n}}

Median

The median is the value in the middle of a dataset.

Mode

The mode is the value of a set of data that appears most often.^[14]

Comparison among mean, median and mode
Values = { 2,3,3,3,3,3,4,4,11 }
Type	Example	Result
Mean	( 2 + 3 + 3 + 3 + 3 + 3 + 4 + 4 + 11 ) / 9	4
Median	2, 3, 3, 3, 3, 3, 4, 4, 11	3
Mode	2, 3, 3, 3, 3, 3, 4, 4, 11	3

Box plot

Box plot is a method for graphically depicting groups of numerical data. The maximum and minimum values are represented by the lines, and the interquartile range (IQR) represent 25–75% of the data. Outliers may be plotted as circles.

Correlation coefficients

Although correlations between two different kinds of data could be inferred by graphs, such as scatter plot, it is necessary validate this though numerical information. For this reason, correlation coefficients are required. They provide a numerical value that reflects the strength of an association.^[10]

Pearson correlation coefficient

Pearson correlation coefficient is a measure of association between two variables, X and Y. This coefficient, usually represented by ρ (rho) for the population and r for the sample, assumes values between −1 and 1, where ρ = 1 represents a perfect positive correlation, ρ = −1 represents a perfect negative correlation, and ρ = 0 is no linear correlation.^[10]

Inferential statistics

It is used to make inferences^[15] about an unknown population, by estimation and/or hypothesis testing. In other words, it is desirable to obtain parameters to describe the population of interest, but since the data is limited, it is necessary to make use of a representative sample in order to estimate them. With that, it is possible to test previously defined hypotheses and apply the conclusions to the entire population. The standard error of the mean is a measure of variability that is crucial to do inferences.^[5]

Hypothesis testing

Hypothesis testing is essential to make inferences about populations aiming to answer research questions, as settled in "Research planning" section. Authors defined four steps to be set:^[5]

The hypothesis to be tested: as stated earlier, we have to work with the definition of a null hypothesis (H₀), that is going to be tested, and an alternative hypothesis. But they must be defined before the experiment implementation.
Significance level and decision rule: A decision rule depends on the
level of significance, or in other words, the acceptable error rate (α). It is easier to think that we define a critical value that determines the statistical significance when a test statistic
is compared with it. So, α also has to be predefined before the experiment.

Experiment and statistical analysis: This is when the experiment is really implemented following the appropriate experimental design, data is collected and the more suitable statistical tests are evaluated.
Inference: Is made when the null hypothesis is rejected or not rejected, based on the evidence that the comparison of p-values and α brings. It is pointed that the failure to reject H₀ just means that there is not enough evidence to support its rejection, but not that this hypothesis is true.

Confidence intervals

A confidence interval is a range of values that can contain the true real parameter value in given a certain level of confidence. The first step is to estimate the best-unbiased estimate of the population parameter. The upper value of the interval is obtained by the sum of this estimate with the multiplication between the standard error of the mean and the confidence level. The calculation of lower value is similar, but instead of a sum, a subtraction must be applied.^[5]

Statistical considerations

Power and statistical error

When testing a hypothesis, there are two types of statistic errors possible:

statistical power of the test

is 1 − β.

p-value

The p-value is the probability of obtaining results as extreme as or more extreme than those observed, assuming the null hypothesis (H₀) is true. It is also called the calculated probability. It is common to confuse the p-value with the significance level (α), but, the α is a predefined threshold for calling significant results. If p is less than α, the null hypothesis (H₀) is rejected.^[16]

Multiple testing

In multiple tests of the same hypothesis, the probability of the occurrence of falses positives (familywise error rate) increase and some strategy are used to control this occurrence. This is commonly achieved by using a more stringent threshold to reject null hypotheses. The Bonferroni correction defines an acceptable global significance level, denoted by α* and each test is individually compared with a value of α = α*/m. This ensures that the familywise error rate in all m tests, is less than or equal to α*. When m is large, the Bonferroni correction may be overly conservative. An alternative to the Bonferroni correction is to control the false discovery rate (FDR). The FDR controls the expected proportion of the rejected null hypotheses (the so-called discoveries) that are false (incorrect rejections). This procedure ensures that, for independent tests, the false discovery rate is at most q*. Thus, the FDR is less conservative than the Bonferroni correction and have more power, at the cost of more false positives.^[17]

Mis-specification and robustness checks

The main hypothesis being tested (e.g., no association between treatments and outcomes) is often accompanied by other technical assumptions (e.g., about the form of the probability distribution of the outcomes) that are also part of the null hypothesis. When the technical assumptions are violated in practice, then the null may be frequently rejected even if the main hypothesis is true. Such rejections are said to be due to model mis-specification.^[18] Verifying whether the outcome of a statistical test does not change when the technical assumptions are slightly altered (so-called robustness checks) is the main way of combating mis-specification.

Model selection criteria

Model criteria selection will select or model that more approximate true model. The Akaike's Information Criterion (AIC) and The Bayesian Information Criterion (BIC) are examples of asymptotically efficient criteria.

Developments and big data

Recent developments have made a large impact on biostatistics. Two important changes have been the ability to collect data on a high-throughput scale, and the ability to perform much more complex analysis using computational techniques. This comes from the development in areas as sequencing technologies, Bioinformatics and Machine learning (Machine learning in bioinformatics).

Use in high-throughput data

New biomedical technologies like

mass spectrometry (for proteomics) generate enormous amounts of data, allowing many tests to be performed simultaneously.^[19] Careful analysis with biostatistical methods is required to separate the signal from the noise. For example, a microarray could be used to measure many thousands of genes simultaneously, determining which of them have different expression in diseased cells compared to normal cells. However, only a fraction of genes will be differentially expressed.^[20]

Multicollinearity often occurs in high-throughput biostatistical settings. Due to high intercorrelation between the predictors (such as gene expression levels), the information of one predictor might be contained in another one. It could be that only 5% of the predictors are responsible for 90% of the variability of the response. In such a case, one could apply the biostatistical technique of dimension reduction (for example via principal component analysis). Classical statistical techniques like linear or logistic regression and linear discriminant analysis do not work well for high dimensional data (i.e. when the number of observations n is smaller than the number of features or predictors p: n < p). As a matter of fact, one can get quite high R²-values despite very low predictive power of the statistical model. These classical statistical techniques (esp. least squares linear regression) were developed for low dimensional data (i.e. where the number of observations n is much larger than the number of predictors p: n >> p). In cases of high dimensionality, one should always consider an independent validation test set and the corresponding residual sum of squares (RSS) and R² of the validation test set, not those of the training set.

Often, it is useful to pool information from multiple predictors together. For example,

Gene Set Enrichment Analysis (GSEA) considers the perturbation of whole (functionally related) gene sets rather than of single genes.^[21] These gene sets might be known biochemical pathways or otherwise functionally related genes. The advantage of this approach is that it is more robust: It is more likely that a single gene is found to be falsely perturbed than it is that a whole pathway is falsely perturbed. Furthermore, one can integrate the accumulated knowledge about biochemical pathways (like the JAK-STAT signaling pathway

) using this approach.

Bioinformatics advances in databases, data mining, and biological interpretation

The development of

Gene Ontology).^[22] In addition to databases that contain specific molecular information, there are others that are ample in the sense that they store information about an organism or group of organisms. As an example of a database directed towards just one organism, but that contains much data about it, is the Arabidopsis thaliana genetic and molecular database – TAIR.^[23] Phytozome,^[24] in turn, stores the assemblies and annotation files of dozen of plant genomes, also containing visualization and analysis tools. Moreover, there is an interconnection between some databases in the information exchange/sharing and a major initiative was the International Nucleotide Sequence Database Collaboration (INSDC)^[25] which relates data from DDBJ,^[26] EMBL-EBI,^[27] and NCBI.^[28]

Nowadays, increase in size and complexity of molecular datasets leads to use of powerful statistical methods provided by computer science algorithms which are developed by

neural networks implementation and support vector machines

models are examples of common machine learning algorithms.

Collaborative work among molecular biologists, bioinformaticians, statisticians and computer scientists is important to perform an experiment correctly, going from planning, passing through data generation and analysis, and ending with biological interpretation of the results.[22]

Use of computationally intensive methods

On the other hand, the advent of modern computer technology and relatively cheap computing resources have enabled computer-intensive biostatistical methods like bootstrapping and re-sampling methods.

In recent times,

random forests have gained popularity as a method for performing statistical classification. Random forest techniques generate a panel of decision trees. Decision trees have the advantage that you can draw them and interpret them (even with a basic understanding of mathematics and statistics). Random Forests have thus been used for clinical decision support systems.^{[citation needed}

]

Applications

Public health

Public health, including epidemiology, health services research, nutrition, environmental health and health care policy & management. In these medicine contents, it's important to consider the design and analysis of the clinical trials. As one example, there is the assessment of severity state of a patient with a prognosis of an outcome of a disease.

With new technologies and genetics knowledge, biostatistics are now also used for Systems medicine, which consists in a more personalized medicine. For this, is made an integration of data from different sources, including conventional patient data, clinico-pathological parameters, molecular and genetic data as well as data generated by additional new-omics technologies.^[29]

Quantitative genetics

The study of

gene map based on linkage have to be built. Some of the best-known QTL mapping algorithms are Interval Mapping, Composite Interval Mapping, and Multiple Interval Mapping.^[30]

However, QTL mapping resolution is impaired by the amount of recombination assayed, a problem for species in which it is difficult to obtain large offspring. Furthermore, allele diversity is restricted to individuals originated from contrasting parents, which limit studies of allele diversity when we have a panel of individuals representing a natural population.[31] For this reason, the genome-wide association study was proposed in order to identify QTLs based on linkage disequilibrium, that is the non-random association between traits and molecular markers. It was leveraged by the development of high-throughput SNP genotyping.^[32]

In animal and plant breeding, the use of markers in selection aiming for breeding, mainly the molecular ones, collaborated to the development of marker-assisted selection. While QTL mapping is limited due resolution, GWAS does not have enough power when rare variants of small effect that are also influenced by environment. So, the concept of Genomic Selection (GS) arises in order to use all molecular markers in the selection and allow the prediction of the performance of candidates in this selection. The proposal is to genotype and phenotype a training population, develop a model that can obtain the genomic estimated breeding values (GEBVs) of individuals belonging to a genotype and but not phenotype population, called testing population.^[33] This kind of study could also include a validation population, thinking in the concept of cross-validation, in which the real phenotype results measured in this population are compared with the phenotype results based on the prediction, what used to check the accuracy of the model.

As a summary, some points about the application of quantitative genetics are:

This has been used in agriculture to improve crops (Plant breeding) and livestock (Animal breeding).
In biomedical research, this work can assist in finding candidates gene alleles that can cause or influence predisposition to diseases in human genetics

Expression data

Studies for differential expression of genes from

microarrays, demands comparison of conditions. The goal is to identify genes which have a significant change in abundance between different conditions. Then, experiments are designed appropriately, with replicates for each condition/treatment, randomization and blocking, when necessary. In RNA-Seq, the quantification of expression uses the information of mapped reads that are summarized in some genetic unit, as exons that are part of a gene sequence. As microarray results can be approximated by a normal distribution, RNA-Seq counts data are better explained by other distributions. The first used distribution was the Poisson one, but it underestimate the sample error, leading to false positives. Currently, biological variation is considered by methods that estimate a dispersion parameter of a negative binomial distribution. Generalized linear models are used to perform the tests for statistical significance and as the number of genes is high, multiple tests correction have to be considered.^[34] Some examples of other analysis on genomics data comes from microarray or proteomics experiments.^[35]^[36] Often concerning diseases or disease stages.^[37]

Other studies

Ecology, ecological forecasting
Biological sequence analysis^[38]
Systems biology for gene network inference or pathways analysis.^[39]
Clinical research and pharmaceutical development
Population dynamics, especially in regards to fisheries science.
Phylogenetics and evolution
Pharmacodynamics
Pharmacokinetics
Neuroimaging

Tools

There are a lot of tools that can be used to do statistical analysis in biological data. Most of them are useful in other areas of knowledge, covering a large number of applications (alphabetical). Here are brief descriptions of some of them:

ASReml: Another software developed by VSNi^[40] that can be used also in R environment as a package. It is developed to estimate variance components under a general linear mixed model using restricted maximum likelihood (REML). Models with fixed effects and random effects and nested or crossed ones are allowed. Gives the possibility to investigate different variance-covariance matrix structures.
CycDesigN:^[41] A computer package developed by VSNi^[40] that helps the researchers create experimental designs and analyze data coming from a design present in one of three classes handled by CycDesigN. These classes are resolvable, non-resolvable, partially replicated and crossover designs. It includes less used designs the Latinized ones, as t-Latinized design.^[42]
Orange: A programming interface for high-level data processing, data mining and data visualization. Include tools for gene expression and genomics.^[22]
R: An open source environment and programming language dedicated to statistical computing and graphics. It is an implementation of S language maintained by CRAN.^[43] In addition to its functions to read data tables, take descriptive statistics, develop and evaluate models, its repository contains packages developed by researchers around the world. This allows the development of functions written to deal with the statistical analysis of data that comes from specific applications.^[44] In the case of Bioinformatics, for example, there are packages located in the main repository (CRAN) and in others, as Bioconductor. It is also possible to use packages under development that are shared in hosting-services as GitHub.
SAS: A data analysis software widely used, going through universities, services and industry. Developed by a company with the same name (SAS Institute), it uses SAS language for programming.
PLA 3.0:^[45] Is a biostatistical analysis software for regulated environments (e.g. drug testing) which supports Quantitative Response Assays (Parallel-Line, Parallel-Logistics, Slope-Ratio) and Dichotomous Assays (Quantal Response, Binary Assays). It also supports weighting methods for combination calculations and the automatic data aggregation of independent assay data.
Weka: A Java software for machine learning and data mining, including tools and methods for visualization, clustering, regression, association rule, and classification. There are tools for cross-validation, bootstrapping and a module of algorithm comparison. Weka also can be run in other programming languages as Perl or R.^[22]

Python (programming language) image analysis, deep-learning, machine-learning
SQL databases
NoSQL
NumPy numerical python
SciPy
SageMath
LAPACK linear algebra
MATLAB
Apache Hadoop
Apache Spark
Amazon Web Services

Scope and training programs

Almost all educational programmes in biostatistics are at

postgraduate

level. They are most often found in schools of public health, affiliated with schools of medicine, forestry, or agriculture, or as a focus of application in departments of statistics.

In the United States, where several universities have dedicated biostatistics departments, many other top-tier universities integrate biostatistics faculty into statistics or other departments, such as epidemiology. Thus, departments carrying the name "biostatistics" may exist under quite different structures. For instance, relatively new biostatistics departments have been founded with a focus on bioinformatics and computational biology, whereas older departments, typically affiliated with schools of public health, will have more traditional lines of research involving epidemiological studies and clinical trials as well as bioinformatics. In larger universities around the world, where both a statistics and a biostatistics department exist, the degree of integration between the two departments may range from the bare minimum to very close collaboration. In general, the difference between a statistics program and a biostatistics program is twofold: (i) statistics departments will often host theoretical/methodological research which are less common in biostatistics programs and (ii) statistics departments have lines of research that may include biomedical applications but also other areas such as industry (quality control), business and economics and biological areas other than medicine.

Specialized journals

Biostatistics^[46]
International Journal of Biostatistics^[47]
Journal of Epidemiology and Biostatistics^[48]
Biostatistics and Public Health^[49]
Biometrics^[50]
Biometrika^[51]
Biometrical Journal^[52]
Communications in Biometry and Crop Science^[53]
Statistical Applications in Genetics and Molecular Biology^[54]
Statistical Methods in Medical Research^[55]
Pharmaceutical Statistics^[56]
Statistics in Medicine^[57]

References

^ Centre for Transformative Innovation, Swinburne University of Technology. "Allan, Frances Elizabeth (Betty) - Person - Encyclopedia of Australian Science and Innovation". www.eoas.info. Retrieved 2022-10-26.
PMID 19079046
.

^ Charles T. Munger (2003-10-03). "Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs" (PDF). Archived (PDF) from the original on 2022-10-09.

^
PMID 28778775
.

^
PMID 18042950
.

S2CID 30875225
.

S2CID 10733556
.

^ Maths, Sangaku. "Absolute, relative, cumulative frequency and statistical tables – Probability and Statistics". www.sangakoo.com. Retrieved 2018-04-10.

^ ^a ^b "DATASUS: TabNet Win32 3.0: Nascidos vivos – Brasil". DATASUS: Tecnologia da Informação a Serviço do SUS.

^
ISBN 978-0-12-262270-0
.

ISSN 0264-3820
.

OCLC 56568530
.

OCLC 30301196
.

^ Gujarati, Damodar N. (2006). Econometrics. McGraw-Hill Irwin.

ISSN 1326-0200
.

PMID 26961635
.

^ Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289–300 (1995).

^ "Null hypothesis". www.statlect.com. Retrieved 2018-05-08.

PMID 22329008
.

S2CID 8417479
.

PMID 16199517
.

^
S2CID 221831488
.

^ "TAIR - Home Page". www.arabidopsis.org.

^ "Phytozome". phytozome.jgi.doe.gov.

^ "International Nucleotide Sequence Database Collaboration - INSDC". www.insdc.org.

^ "Top". www.ddbj.nig.ac.jp. 11 January 2024.

^ "The European Bioinformatics Institute < EMBL-EBI". www.ebi.ac.uk.

^ "National Center for Biotechnology Information". www.ncbi.nlm.nih.gov. U. S. National Library of Medicine –.

PMID 29497170
.

S2CID 1094152
.

PMID 23876160
.

doi:10.3835/plantgenome2008.02.0089
.

PMID 28965742. Archived
(PDF) from the original on 2022-10-09.

PMID 21176179
.

^ Helen Causton; John Quackenbush; Alvis Brazma (2003). Statistical Analysis of Gene Expression Microarray Data. Wiley-Blackwell.

^ Terry Speed (2003). Microarray Gene Expression Data Analysis: A Beginner's Guide. Chapman & Hall/CRC.

ISBN 978-3-527-32585-6
.

^ Warren J. Ewens; Gregory R. Grant (2004). Statistical Methods in Bioinformatics: An Introduction. Springer.

ISBN 978-3-527-32750-8
.

^ ^a ^b "Home - VSN International". www.vsni.co.uk.

^ "CycDesigN - VSN International". www.vsni.co.uk.

doi:10.2134/agronj15.0144
.

^ "The Comprehensive R Archive Network". cran.r-project.org.

ISBN 9789354936586
.

^ Stegmann, Dr Ralf (2019-07-01). "PLA 3.0". PLA 3.0 – Software for Biostatistical Analysis. Retrieved 2019-07-02.

^ "Biostatistics - Oxford Academic". OUP Academic.

^ "The International Journal of Biostatistics".

^ "PubMed Journals will be shut down". 15 June 2018.

^ https://ebph.it/ Epidemiology

doi:10.1111/(ISSN)1541-0420
.

^ "Biometrika - Oxford Academic". OUP Academic.

doi:10.1002/(ISSN)1521-4036
.

^ "Communications in Biometry and Crop Science". agrobiol.sggw.waw.pl.

^ "Statistical Applications in Genetics and Molecular Biology". www.degruyter.com. 1 May 2002.

^ "Statistical Methods in Medical Research". SAGE Journals.

^ "Pharmaceutical Statistics". onlinelibrary.wiley.com.

doi:10.1002/(ISSN)1097-0258
.

External links

Media related to Biostatistics at Wikimedia Commons

The International Biometric Society

The Collection of Biostatistics Research Archive

Guide to Biostatistics (MedPageToday.com) Archived 2012-05-22 at the Wayback Machine

Biomedical Statistics

v
t
e
Statistics

Outline

Index

Continuous data
Center

Mean
Arithmetic

Arithmetic-Geometric

Cubic

Generalized/power

Geometric

Harmonic

Heronian

Heinz

Lehmer

Median

Mode

Dispersion

Average absolute deviation

Coefficient of variation

Interquartile range

Percentile

Range

Standard deviation

Variance

Shape

Central limit theorem

Moments
Kurtosis

L-moments

Skewness

Count data

Index of dispersion

Summary tables

Contingency table

Frequency distribution

Grouped data

Dependence

Partial correlation

Pearson product-moment correlation

Rank correlation
Kendall's τ

Spearman's ρ

Scatter plot

Graphics

Bar chart

Biplot

Box plot

Control chart

Correlogram

Fan chart

Forest plot

Histogram

Pie chart

Q–Q plot

Radar chart

Run chart

Scatter plot

Stem-and-leaf display

Violin plot

Data collection
Study design

Effect size

Missing data

Optimal design

Population

Replication

Sample size determination

Statistic

Statistical power

Survey methodology

Sampling
Cluster

Stratified

Opinion poll

Questionnaire

Standard error

Controlled experiments

Blocking

Factorial experiment

Interaction

Random assignment

Randomized controlled trial

Randomized experiment

Scientific control

Adaptive designs

Adaptive clinical trial

Stochastic approximation

Up-and-down designs

Observational studies

Cohort study

Cross-sectional study

Natural experiment

Quasi-experiment

Statistical inference
Statistical theory

Population

Statistic

Probability distribution

Sampling distribution
Order statistic

Empirical distribution
Density estimation

Statistical model
Model specification

L^p space

Parameter
location

scale

shape

Parametric family
Likelihood (monotone)

Location–scale family

Exponential family

Completeness

Sufficiency

Statistical functional

Bootstrap

U

V

Optimal decision
loss function

Efficiency

Statistical distance
divergence

Asymptotics

Robustness

Frequentist inference
Point estimation

Estimating equations
Maximum likelihood

Method of moments

M-estimator

Minimum distance

Unbiased estimators
Mean-unbiased minimum-variance
Rao–Blackwellization

Lehmann–Scheffé theorem

Median unbiased

Plug-in

Interval estimation

Confidence interval

Pivot

Likelihood interval

Prediction interval

Tolerance interval

Resampling
Bootstrap

Jackknife

Testing hypotheses

1- & 2-tails

Power

Uniformly most powerful test

Permutation test
Randomization test

Multiple comparisons

Parametric tests

Likelihood-ratio

Score/Lagrange multiplier

Wald

Specific tests

Z-test (normal)

Student's t-test

F-test

Goodness of fit

Chi-squared

G-test

Kolmogorov–Smirnov

Anderson–Darling

Lilliefors

Jarque–Bera

Normality (Shapiro–Wilk)

Likelihood-ratio test

Model selection
Cross validation

AIC

BIC

Rank statistics

Sign
Sample median

Signed rank (Wilcoxon)
Hodges–Lehmann estimator

Rank sum (Mann–Whitney)

Nonparametric anova
1-way (Kruskal–Wallis)

2-way (Friedman)

Ordered alternative (Jonckheere–Terpstra)

Van der Waerden test

Bayesian inference

Bayesian probability
prior

posterior

Credible interval

Bayes factor

Bayesian estimator
Maximum posterior estimator

Correlation

Pearson product-moment

Partial correlation

Confounding variable

Coefficient of determination

Regression analysis

Errors and residuals

Regression validation

Mixed effects models

Simultaneous equations models

Multivariate adaptive regression splines (MARS)

Linear regression

Simple linear regression

Ordinary least squares

General linear model

Bayesian regression

Non-standard predictors

Nonlinear regression

Nonparametric

Semiparametric

Isotonic

Robust

Heteroscedasticity

Homoscedasticity

Generalized linear model

Exponential families

Logistic (Bernoulli) / Binomial / Poisson regressions

Partition of variance

Analysis of variance (ANOVA, anova)

Analysis of covariance

Multivariate ANOVA

Degrees of freedom

Categorical / Multivariate / Time-series / Survival analysis
Categorical

Cohen's kappa

Contingency table

Graphical model

Log-linear model

McNemar's test

Cochran–Mantel–Haenszel statistics

Multivariate

Regression

Manova

Principal components

Canonical correlation

Discriminant analysis

Cluster analysis

Classification

Structural equation model
Factor analysis

Multivariate distributions

Elliptical distributions
Normal

Time-series
General

Decomposition

Trend

Stationarity

Seasonal adjustment

Exponential smoothing

Cointegration

Structural break

Granger causality

Specific tests

Dickey–Fuller

Johansen

Q-statistic (Ljung–Box)

Durbin–Watson

Breusch–Godfrey

Time domain

Autocorrelation (ACF)
partial (PACF)

Cross-correlation (XCF)

ARMA model

ARIMA model (Box–Jenkins)

Autoregressive conditional heteroskedasticity (ARCH)

Vector autoregression (VAR)

Frequency domain

Spectral density estimation

Fourier analysis

Least-squares spectral analysis

Wavelet

Whittle likelihood

Survival
Survival function

Kaplan–Meier estimator (product limit)

Proportional hazards models

Accelerated failure time (AFT) model

First hitting time

Hazard function

Nelson–Aalen estimator

Test

Log-rank test

Applications
Biostatistics

Bioinformatics

Clinical trials / studies

Epidemiology

Medical statistics

Engineering statistics

Chemometrics

Methods engineering

Probabilistic design

Process / quality control

Reliability

System identification

Social statistics

Actuarial science

Census

Crime statistics

Demography

Econometrics

Jurimetrics

National accounts

Official statistics

Population statistics

Psychometrics

Spatial statistics

Cartography

Environmental statistics

Geographic information system

Geostatistics

Kriging

Category

Mathematics portal

Commons

WikiProject

v
t
e
Branches of biology

Abiogenesis

Aerobiology

Agronomy

Agrostology

Anatomy

Astrobiology

Bacteriology

Biochemistry

Biogeography

Biogeology

Bioinformatics

Biological engineering

Biomechanics

Biophysics

Biosemiotics

Biostatistics

Biotechnology

Botany

Cell biology

Cellular microbiology

Chemical biology

Chronobiology

Cognitive biology

Computational biology

Conservation biology

Cryobiology

Cytogenetics

Dendrology

Developmental biology

Ecological genetics

Ecology

Embryology

Epidemiology

Epigenetics

Evolutionary biology

Freshwater biology

Generative biology

Genetics

Genomics

Geobiology

Gerontology

Herpetology

Histology

Human biology

Ichthyology

Immunology

Lipidology

Mammalogy

Marine biology

Mathematical biology

Microbiology

Molecular biology

Mycology

Neontology

Neuroscience

Nutrition

Ornithology

Osteology

Paleontology

Parasitology

Pathology

Pharmacology

Photobiology

Phycology

Phylogenetics

Physiology

Pomology

Primatology

Proteomics

Protistology

Quantum biology

Relational biology

Reproductive biology

Sociobiology

Structural biology

Synthetic biology

Systematics

Systems biology

Taxonomy

Teratology

Toxicology

Virology

Virophysics

Xenobiology

Zoology

See also

History of biology

Nobel Prize in Physiology or Medicine

Timeline of biology and organic chemistry

v
t
e
Public health
General

Auxology

Biological hazard

Chief Medical Officer

Cultural competence

Deviance

Environmental health

Euthenics

Genomics

Globalization and disease

Harm reduction

Health economics

Health literacy

Health policy
Health system

Health care reform

Management of depression
Public health law

National public health institute

Health politics

Maternal health

Medical anthropology

Medical sociology

Mental health (Ministers)

Pharmaceutical policy

Pollution
Air

Water

Soil

Radiation

Light

Public health intervention

Public health laboratory

Sexual and reproductive health

Social psychology

Sociology of health and illness

Preventive healthcare

Behavior change
Theories

Family planning

Health promotion

Human nutrition
Healthy diet

Preventive nutrition

Hygiene
Food safety

Hand washing

Infection control

Oral hygiene

Occupational safety and health
Human factors and ergonomics

Hygiene

Controlled Drugs

Injury prevention

Medicine

Nursing

Patient safety
Organization

Pharmacovigilance

Safe sex

Sanitation
Emergency

Fecal–oral transmission

Open defecation

Sanitary sewer

Waterborne diseases

Worker

School hygiene

Smoking cessation

Vaccination

Vector control

Population health

Biostatistics

Child mortality

Community health

Epidemiology

Global health

Health impact assessment

Health system

Infant mortality

Open-source healthcare software

Multimorbidity

Public health informatics

Social determinants of health
Commercial determinants of health

Health equity

Race and health

Social medicine

Biological and
epidemiological statistics

Case–control study

Randomized controlled trial

Relative risk

Statistical hypothesis testing

Analysis of variance (ANOVA)

Regression analysis

ROC curve

Student's t-test

Z-test

Statistical software

Infectious and epidemic
disease prevention

Asymptomatic carrier

Epidemics
List

Notifiable diseases
List

Public health surveillance
Disease surveillance

Quarantine

Sexually transmitted infection

Social distancing

Tropical disease

Vaccine trial

Food hygiene and
safety management

Food
Additive

Chemistry

Engineering

Microbiology

Processing

Safety

Safety scandals

Genetically modified food

Good agricultural practice

Good manufacturing practice
HACCP

ISO 22000

Health behavioral
sciences

Diffusion of innovations

Health belief model

Health communication

Health psychology

Positive deviance

PRECEDE–PROCEED model

Social cognitive theory

Social norms approach

Theory of planned behavior

Transtheoretical model

Organizations,
education
and history
Organizations

Caribbean
Caribbean Public Health Agency

China
Center for Disease Control and Prevention

Europe
Centre for Disease Prevention and Control

Committee on the Environment, Public Health and Food Safety

India
Ministry of Health and Family Welfare

Canada
Health Canada

Public Health Agency

U.S.
Centers for Disease Control and Prevention

City and county health departments

Council on Education for Public Health

Public Health Service

World Health Organization

World Toilet Organization

(Full list)

Education

Health education

Higher education
Bachelor of Science in Public Health

Doctor of Public Health

Professional degrees of public health

Schools of public health

History

Sara Josephine Baker

Samuel Jay Crumbine

Carl Rogers Darnall

Joseph Lister

Margaret Sanger

John Snow

Typhoid Mary

Radium Girls

Germ theory of disease

Social hygiene movement

Category

Commons

WikiProject

Authority control databases: National

Germany

Israel

United States

Retrieved from "https://en.wikipedia.org/w/index.php?title=Biostatistics&oldid=1214243579"

[1] Centre for Transformative Innovation, Swinburne University of Technology. "Allan, Frances Elizabeth (Betty) - Person - Encyclopedia of Australian Science and Innovation". www.eoas.info. Retrieved 2022-10-26.

[2] PMID 19079046
.

[3] Charles T. Munger (2003-10-03). "Academic Economics: Strengths and Faults After Considering Interdisciplinary Needs" (PDF). Archived (PDF) from the original on 2022-10-09.

[:3-4] 
PMID 28778775
.

[:2-5] 
PMID 18042950
.

[6] S2CID 30875225
.

[7] S2CID 10733556
.

[8] Maths, Sangaku. "Absolute, relative, cumulative frequency and statistical tables – Probability and Statistics". www.sangakoo.com. Retrieved 2018-04-10.

[:1-9] "DATASUS: TabNet Win32 3.0: Nascidos vivos – Brasil". DATASUS: Tecnologia da Informação a Serviço do SUS.

[:0-10] 
ISBN 978-0-12-262270-0
.

[11] ISSN 0264-3820
.

[12] OCLC 56568530
.

[13] OCLC 30301196
.

[14] Gujarati, Damodar N. (2006). Econometrics. McGraw-Hill Irwin.

[15] ISSN 1326-0200
.

[16] PMID 26961635
.

[17] Benjamini, Y. & Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Journal of the Royal Statistical Society. Series B (Methodological) 57, 289–300 (1995).

[18] "Null hypothesis". www.statlect.com. Retrieved 2018-05-08.

[19] PMID 22329008
.

[20] S2CID 8417479
.

[21] PMID 16199517
.

[:4-22] 
S2CID 221831488
.

[23] "TAIR - Home Page". www.arabidopsis.org.

[24] "Phytozome". phytozome.jgi.doe.gov.

[25] "International Nucleotide Sequence Database Collaboration - INSDC". www.insdc.org.

[26] "Top". www.ddbj.nig.ac.jp. 11 January 2024.

[27] "The European Bioinformatics Institute < EMBL-EBI". www.ebi.ac.uk.

[28] "National Center for Biotechnology Information". www.ncbi.nlm.nih.gov. U. S. National Library of Medicine –.

[29] PMID 29497170
.

[30] S2CID 1094152
.

[31] PMID 23876160
.

[32] :10.3835/plantgenome2008.02.0089
.

[33] PMID 28965742. Archived
(PDF) from the original on 2022-10-09.

[34] PMID 21176179
.

[35] Helen Causton; John Quackenbush; Alvis Brazma (2003). Statistical Analysis of Gene Expression Microarray Data. Wiley-Blackwell.

[36] Terry Speed (2003). Microarray Gene Expression Data Analysis: A Beginner's Guide. Chapman & Hall/CRC.

[37] ISBN 978-3-527-32585-6
.

[38] Warren J. Ewens; Gregory R. Grant (2004). Statistical Methods in Bioinformatics: An Introduction. Springer.

[39] ISBN 978-3-527-32750-8
.

[vsni-40] "Home - VSN International". www.vsni.co.uk.

[41] "CycDesigN - VSN International". www.vsni.co.uk.

[42] :10.2134/agronj15.0144
.

[43] "The Comprehensive R Archive Network". cran.r-project.org.

[44] ISBN 9789354936586
.

[45] Stegmann, Dr Ralf (2019-07-01). "PLA 3.0". PLA 3.0 – Software for Biostatistical Analysis. Retrieved 2019-07-02.

[46] "Biostatistics - Oxford Academic". OUP Academic.

[47] "The International Journal of Biostatistics".

[48] "PubMed Journals will be shut down". 15 June 2018.

[49] ttps://ebph.it/ Epidemiology

[50] :10.1111/(ISSN)1541-0420
.

[51] "Biometrika - Oxford Academic". OUP Academic.

[52] :10.1002/(ISSN)1521-4036
.

[53] "Communications in Biometry and Crop Science". agrobiol.sggw.waw.pl.

[54] "Statistical Applications in Genetics and Molecular Biology". www.degruyter.com. 1 May 2002.

[55] "Statistical Methods in Medical Research". SAGE Journals.

[56] "Pharmaceutical Statistics". onlinelibrary.wiley.com.

[57] :10.1002/(ISSN)1097-0258
.

[2]

[3]

[4]

[6]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[5]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]