Statistics
Statistics |
---|
Part of a series on | ||
Mathematics | ||
---|---|---|
|
||
Mathematics Portal | ||
Statistics (from
When
Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation).[4] Descriptive statistics are most often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution's central or typical value, while dispersion (or variability) characterizes the extent to which members of the distribution depart from its center and each other. Inferences made using mathematical statistics employ the framework of probability theory, which deals with the analysis of random phenomena.
A standard statistical procedure involves the collection of data leading to a
Statistical measurement processes are also prone to error in regards to the data that they generate. Many of these errors are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also occur. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.
Introduction
Statistics is a mathematical body of science that pertains to the collection, analysis, interpretation or explanation, and presentation of data,[5] or as a branch of mathematics.[6] Some consider statistics to be a distinct mathematical science rather than a branch of mathematics. While many scientific investigations make use of data, statistics is generally concerned with the use of data in the context of uncertainty and decision-making in the face of uncertainty.[7][8]
In applying statistics to a problem, it is common practice to start with a
When a census is not feasible, a chosen subset of the population called a
Mathematical statistics
Mathematical statistics is the application of mathematics to statistics. Mathematical techniques used for this include
History
Formal discussions on inference date back to the
Although the term statistic was introduced by the Italian scholar
The mathematical foundations of statistics developed from discussions concerning
The modern field of statistics emerged in the late 19th and early 20th century in three stages.
The second wave of the 1910s and 20s was initiated by
The final wave, which mainly saw the refinement and expansion of earlier developments, emerged from the collaborative work between
Today, statistical methods are applied in all fields that involve decision making, for making accurate inferences from a collated body of data and for making decisions in the face of uncertainty based on statistical methodology. The use of modern computers has expedited large-scale statistical computations and has also made possible new methods that are impractical to perform manually. Statistics continues to be an area of active research, for example on the problem of how to analyze big data.[37]
Statistical data
Data collection
Sampling
When full census data cannot be collected, statisticians collect sample data by developing specific experiment designs and survey samples. Statistics itself also provides tools for prediction and forecasting through statistical models.
To use a sample as a guide to an entire population, it is important that it truly represents the overall population. Representative sampling assures that inferences and conclusions can safely extend from the sample to the population as a whole. A major problem lies in determining the extent that the sample chosen is actually representative. Statistics offers methods to estimate and correct for any bias within the sample and data collection procedures. There are also methods of experimental design that can lessen these issues at the outset of a study, strengthening its capability to discern truths about the population.
Sampling theory is part of the
Experimental and observational studies
A common goal for a statistical research project is to investigate
Experiments
The basic steps of a statistical experiment are:
- Planning the research, including finding the number of replicates of the study, using the following information: preliminary estimates regarding the size of experimental variability. Consideration of the selection of experimental subjects and the ethics of research is necessary. Statisticians recommend that experiments compare (at least) one new treatment with a standard treatment or control, to allow an unbiased estimate of the difference in treatment effects.
- experimental protocolthat will guide the performance of the experiment and which specifies the primary analysis of the experimental data.
- Performing the experiment following the experimental protocol and analyzing the datafollowing the experimental protocol.
- Further examining the data set in secondary analyses, to suggest new hypotheses for future study.
- Documenting and presenting the results of the study.
Experiments on human behavior have special concerns. The famous
Observational study
An example of an observational study is one that explores the association between smoking and lung cancer. This type of study typically uses a survey to collect observations about the area of interest and then performs statistical analysis. In this case, the researchers would collect observations of both smokers and non-smokers, perhaps through a
Types of data
Various attempts have been made to produce a taxonomy of levels of measurement. The psychophysicist Stanley Smith Stevens defined nominal, ordinal, interval, and ratio scales. Nominal measurements do not have meaningful rank order among values, and permit any one-to-one (injective) transformation. Ordinal measurements have imprecise differences between consecutive values, but have a meaningful order to those values, and permit any order-preserving transformation. Interval measurements have meaningful distances between measurements defined, but the zero value is arbitrary (as in the case with longitude and temperature measurements in Celsius or Fahrenheit), and permit any linear transformation. Ratio measurements have both a meaningful zero value and the distances between different measurements defined, and permit any rescaling transformation.
Because variables conforming only to nominal or ordinal measurements cannot be reasonably measured numerically, sometimes they are grouped together as categorical variables, whereas ratio and interval measurements are grouped together as quantitative variables, which can be either discrete or continuous, due to their numerical nature. Such distinctions can often be loosely correlated with data type in computer science, in that dichotomous categorical variables may be represented with the Boolean data type, polytomous categorical variables with arbitrarily assigned integers in the integral data type, and continuous variables with the real data type involving floating-point arithmetic. But the mapping of computer science data types to statistical data types depends on which categorization of the latter is being implemented.
Other categorizations have been proposed. For example, Mosteller and Tukey (1977)[41] distinguished grades, ranks, counted fractions, counts, amounts, and balances. Nelder (1990)[42] described continuous counts, continuous ratios, count ratios, and categorical modes of data. (See also: Chrisman (1998),[43] van den Berg (1991).[44])
The issue of whether or not it is appropriate to apply different kinds of statistical methods to data obtained from different kinds of measurement procedures is complicated by issues concerning the transformation of variables and the precise interpretation of research questions. "The relationship between the data and what they describe merely reflects the fact that certain kinds of statistical statements may have truth values which are not invariant under some transformations. Whether or not a transformation is sensible to contemplate depends on the question one is trying to answer."[45]: 82
Methods
This section needs additional citations for verification. (December 2020) |
Descriptive statistics
A descriptive statistic (in the
Inferential statistics
Statistical inference is the process of using data analysis to deduce properties of an underlying probability distribution.[48] Inferential statistical analysis infers properties of a population, for example by testing hypotheses and deriving estimates. It is assumed that the observed data set is sampled from a larger population. Inferential statistics can be contrasted with descriptive statistics. Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the assumption that the data come from a larger population.[49]
Terminology and theory of inferential statistics
Statistics, estimators and pivotal quantities
Consider
A statistic is a random variable that is a function of the random sample, but not a function of unknown parameters. The probability distribution of the statistic, though, may have unknown parameters. Consider now a function of the unknown parameter: an
A random variable that is a function of the random sample and of the unknown parameter, but whose probability distribution does not depend on the unknown parameter is called a
Between two estimators of a given parameter, the one with lower
Other desirable properties for estimators include:
This still leaves the question of how to obtain estimators in a given situation and carry the computation, several methods have been proposed: the
Null hypothesis and alternative hypothesis
Interpretation of statistical information can often involve the development of a null hypothesis which is usually (but not necessarily) that no relationship exists among variables or that no change occurred over time.[51][52]
The best illustration for a novice is the predicament encountered by a criminal trial. The null hypothesis, H0, asserts that the defendant is innocent, whereas the alternative hypothesis, H1, asserts that the defendant is guilty. The indictment comes because of suspicion of the guilt. The H0 (status quo) stands in opposition to H1 and is maintained unless H1 is supported by evidence "beyond a reasonable doubt". However, "failure to reject H0" in this case does not imply innocence, but merely that the evidence was insufficient to convict. So the jury does not necessarily accept H0 but fails to reject H0. While one can not "prove" a null hypothesis, one can test how close it is to being true with a
What
Error
Working from a null hypothesis, two broad categories of error are recognized:
- Type I errors where the null hypothesis is falsely rejected, giving a "false positive".
- Type II errors where the null hypothesis fails to be rejected and an actual difference between populations is missed, giving a "false negative".
A
Many statistical methods seek to minimize the
Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as
Interval estimation
Most studies only sample part of a population, so results do not fully represent the whole population. Any estimates obtained from the sample only approximate the population value.
In principle confidence intervals can be symmetrical or asymmetrical. An interval can be asymmetrical because it works as lower or upper bound for a parameter (left-sided interval or right sided interval), but it can also be asymmetrical because the two sided interval is built violating symmetry around the estimate. Sometimes the bounds for a confidence interval are reached asymptotically and these are used to approximate the true bounds.
Significance
Statistics rarely give a simple Yes/No type answer to the question under analysis. Interpretation often comes down to the level of statistical significance applied to the numbers and often refers to the probability of a value accurately rejecting the null hypothesis (sometimes referred to as the p-value).
The standard approach
Referring to statistical significance does not necessarily mean that the overall result is significant in real world terms. For example, in a large study of a drug it may be shown that the drug has a statistically significant but very small beneficial effect, such that the drug is unlikely to help the patient noticeably.
Although in principle the acceptable level of statistical significance may be subject to debate, the
Some problems are usually associated with this framework (See
- A difference that is highly statistically significant can still be of no practical significance, but it is possible to properly formulate tests to account for this. One response involves going beyond reporting only the significance level to include the p-value when reporting whether a hypothesis is rejected or accepted. The p-value, however, does not indicate the size or importance of the observed effect and can also seem to exaggerate the importance of minor differences in large studies. A better and increasingly common approach is to report confidence intervals. Although these are produced from the same calculations as those of hypothesis tests or p-values, they describe both the size of the effect and the uncertainty surrounding it.
- Fallacy of the transposed conditional, aka prosecutor's fallacy: criticisms arise because the hypothesis testing approach forces one hypothesis (the null hypothesis) to be favored, since what is being evaluated is the probability of the observed result given the null hypothesis and not probability of the null hypothesis given the observed result. An alternative to this approach is offered by Bayesian inference, although it requires establishing a prior probability.[54]
- Rejecting the null hypothesis does not automatically prove the alternative hypothesis.
- As everything in fat tails p-values may be seriously mis-computed.[clarification needed]
Examples
Some well-known statistical
- Analysis of variance (ANOVA)
- Chi-squared test
- Correlation
- Factor analysis
- Mann–Whitney U
- Mean square weighted deviation(MSWD)
- Pearson product-moment correlation coefficient
- Regression analysis
- Spearman's rank correlation coefficient
- Student's t-test
- Time series analysis
- Conjoint Analysis
Exploratory data analysis
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
Misuse
Misuse of statistics can produce subtle but serious errors in description and interpretation—subtle in the sense that even experienced professionals make such errors, and serious in the sense that they can lead to devastating decision errors. For instance, social policy, medical practice, and the reliability of structures like bridges all rely on the proper use of statistics.
Even when statistical techniques are correctly applied, the results can be difficult to interpret for those lacking expertise. The statistical significance of a trend in the data—which measures the extent to which a trend could be caused by random variation in the sample—may or may not agree with an intuitive sense of its significance. The set of basic statistical skills (and skepticism) that people need to deal with information in their everyday lives properly is referred to as statistical literacy.
There is a general perception that statistical knowledge is all-too-frequently intentionally misused by finding ways to interpret only the data that are favorable to the presenter.[55] A mistrust and misunderstanding of statistics is associated with the quotation, "There are three kinds of lies: lies, damned lies, and statistics". Misuse of statistics can be both inadvertent and intentional, and the book How to Lie with Statistics,[55] by Darrell Huff, outlines a range of considerations. In an attempt to shed light on the use and misuse of statistics, reviews of statistical techniques used in particular fields are conducted (e.g. Warne, Lazo, Ramos, and Ritter (2012)).[56]
Ways to avoid misuse of statistics include using proper diagrams and avoiding
To assist in the understanding of statistics Huff proposed a series of questions to be asked in each case:[55]
- Who says so? (Does he/she have an axe to grind?)
- How does he/she know? (Does he/she have the resources to know the facts?)
- What's missing? (Does he/she give us a complete picture?)
- Did someone change the subject? (Does he/she offer us the right answer to the wrong problem?)
- Does it make sense? (Is his/her conclusion logical and consistent with what we already know?)
Misinterpretation: correlation
The concept of
Applications
Applied statistics, theoretical statistics and mathematical statistics
Applied statistics, sometimes referred to as Statistical science,[61] comprises descriptive statistics and the application of inferential statistics.[62][63] Theoretical statistics concerns the logical arguments underlying justification of approaches to statistical inference, as well as encompassing mathematical statistics. Mathematical statistics includes not only the manipulation of probability distributions necessary for deriving results related to methods of estimation and inference, but also various aspects of computational statistics and the design of experiments.
Machine learning and data mining
Machine learning models are statistical and probabilistic models that capture patterns in the data through use of computational algorithms.
Statistics in academia
Statistics is applicable to a wide variety of
A typical statistics course covers descriptive statistics, probability, binomial and
Statistical computing
The rapid and sustained increases in computing power starting from the second half of the 20th century have had a substantial impact on the practice of statistical science. Early statistical models were almost always from the class of
Increased computing power has also led to the growing popularity of computationally intensive methods based on
.Business statistics
In business, "statistics" is a widely used
A typical "Business Statistics" course is intended for
Statistics applied to mathematics or the arts
Traditionally, statistics was concerned with drawing inferences using a semi-standardized methodology that was "required learning" in most sciences. This tradition has changed with the use of statistics in non-inferential contexts. What was once considered a dry subject, taken in many fields as a degree-requirement, is now viewed enthusiastically.[according to whom?] Initially derided by some mathematical purists, it is now considered essential methodology in certain areas.
- In number theory, scatter plots of data generated by a distribution function may be transformed with familiar tools used in statistics to reveal underlying patterns, which may then lead to hypotheses.
- Predictive methods of statistics in fractal geometry can be used to create video works.[70]
- The process art of Jackson Pollock relied on artistic experiments whereby underlying distributions in nature were artistically revealed.[71] With the advent of computers, statistical methods were applied to formalize such distribution-driven natural processes to make and analyze moving video art.[citation needed]
- Methods of statistics may be used predicatively in Markov processthat only works some of the time, the occasion of which can be predicted using statistical methodology.
- Statistics can be used to predicatively create art, as in the statistical or stochastic music invented by Iannis Xenakis, where the music is performance-specific. Though this type of artistry does not always come out as expected, it does behave in ways that are predictable and tunable using statistics.
Specialized disciplines
Statistical techniques are used in a wide range of types of scientific and social research, including:
- Actuarial science (assesses risk in the insurance and finance industries)
- Applied information economics
- Astrostatistics (statistical evaluation of astronomical data)
- Biostatistics
- Chemometrics (for analysis of data from chemistry)
- Data mining (applying statistics and pattern recognition to discover knowledge from data)
- Data science ( )
- Demography (statistical study of populations)
- Econometrics (statistical analysis of economic data)
- Energy statistics
- Engineering statistics
- Epidemiology (statistical analysis of disease)
- Geography and geographic information systems, specifically in spatial analysis
- Image processing
- Jurimetrics (law)
- Medical statistics
- Political science
- Psychological statistics
- Reliability engineering
- Social statistics
- Statistical mechanics
In addition, there are particular types of statistical analysis that have also developed their own specialised terminology and methodology:
- Bootstrap / jackknife resampling
- Multivariate statistics
- Statistical classification
- Structured data analysis
- Structural equation modelling
- Survey methodology
- Survival analysis
- Statistics in various sports, particularly baseball – known as sabermetrics – and cricket
Statistics form a key basis tool in business and manufacturing as well. It is used to understand measurement systems variability, control processes (as in statistical process control or SPC), for summarizing data, and to make data-driven decisions.
See also
- Abundance estimation
- Glossary of probability and statistics
- List of academic statistical associations
- List of important publications in statistics
- List of national and international statistical services
- List of statistical software
- List of statistics articles
- List of university statistical consulting centers
- Notation in probability and statistics
- Statistics education
- World Statistics Day
- Robust statistics
- Foundations and major areas of statistics
- Philosophy of statistics
- Probability interpretations
- Foundations of statistics
- List of statisticians
- Official statistics
- Multivariate analysis of variance
References
- ^
- "statistics". Oxford English Dictionary (Online ed.). Oxford University Press. (Subscription or participating institution membership required.)
- "Statistik". Digitales Wörterbuch der deutschen Sprache (in German). Berlin-Brandenburgischen Akademie der Wissenschaften. August 2024.
- ISBN 978-0-19-954145-4.. Cambridge Dictionary.
- Romijn, Jan-Willem (2014). "Philosophy of statistics". Stanford Encyclopedia of Philosophy.
- "Statistics"
The dependability of a sample can be destroyed by [bias]... allow yourself some degree of skepticism.
- Sharpe, N. (2014). Business Statistics, Pearson. ISBN 978-0134705217
- Wegner, T. (2010). Applied Business Statistics: Methods and Excel-Based Applications, Juta Academic. ISBN 0702172863
- Holmes, L., Illowsky, B., Dean, S. (2017). Introductory Business Statistics Archived 2021-06-16 at the Wayback Machine
- Nica, M. (2013). Principles of Business Statistics Archived 2021-05-18 at the Wayback Machine
Further reading
- Lydia Denworth, "A Significant Problem: Standard scientific methods are under fire. Will anything change?", p values for nearly a century [since 1925] to determine statistical significance of experimental results has contributed to an illusion of certainty and [to] reproducibility crises in many scientific fields. There is growing determination to reform statistical analysis... Some [researchers] suggest changing statistical methods, whereas others would do away with a threshold for defining "significant" results." (p. 63.)
- Barbara Illowsky; Susan Dean (2014). Introductory Statistics. OpenStax CNX. ISBN 978-1938168208.
- Stockburger, David W. "Introductory Statistics: Concepts, Models, and Applications". Missouri State University (3rd Web ed.). Archived from the original on 28 May 2020.
- OpenIntro Statistics Archived 2019-06-16 at the Wayback Machine, 3rd edition by Diez, Barr, and Cetinkaya-Rundel
- Stephen Jones, 2010. Statistics in Psychology: Explanations without Equations. Palgrave Macmillan. ISBN 978-1137282392.
- Cohen, J (1990). "Things I have learned (so far)" (PDF). American Psychologist. 45 (12): 1304–1312. S2CID 7180431. Archived from the original(PDF) on 2017-10-18.
- Gigerenzer, G (2004). "Mindless statistics". Journal of Socio-Economics. 33 (5): 587–606. .
- Ioannidis, J.P.A. (2005). "Why most published research findings are false". PLOS Medicine. 2 (4): 696–701. PMID 17456002.
External links
- (Electronic Version): TIBCO Software Inc. (2020). Data Science Textbook.
- Online Statistics Education: An Interactive Multimedia Course of Study. Developed by Rice University (Lead Developer), University of Houston Clear Lake, Tufts University, and National Science Foundation.
- UCLA Statistical Computing Resources (archived 17 July 2006)
- Philosophy of Statistics from the Stanford Encyclopedia of Philosophy