Data dredging
Data dredging (also known as data snooping or p-hacking)
The process of data dredging involves testing multiple hypotheses using a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.
Conventional tests of
Data dredging is an example of disregarding the multiple comparisons problem. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.[6]
Types
Drawing conclusions from data
The conventional
A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns.
For example,
Hypothesis suggested by non-representative data
Suppose that a study of a
The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be
Systematic bias
Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks.[6] This problem can be very severe, for example, in the observational study.[6][2]
Missing factors, unmeasured
Multiple modelling
Another aspect of the conditioning of
Examples
In meteorology and epidemiology
In meteorology, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.
As another example, suppose that observers note that a particular town appears to have a
In sociology
Another way to flatten a p-curve is to control for
Appearance in media
One example is the chocolate weight loss hoax study conducted by journalist John Bohannon, who explained publicly in a Gizmodo article that the study was deliberately conducted fraudulently as a social experiment.[10] This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This study was published in the Institute of Diet and Health. According to Bohannon, to reduce the p-value to below 0.05, taking 18 different variables into consideration when testing was crucial.
Remedies
While looking for patterns in data is legitimate, applying a statistical test of significance or
Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance (alpha) by this number; this is the
When neither approach is practical, one can make a clear distinction between data analyses that are
Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.
Academic journals increasingly shift to the
Methods and results can also be made publicly available, as in the open science approach, making it yet more difficult for data dredging to take place.[14]
See also
- Aliasing – Signal processing effect
- Base rate fallacy – Error in thinking which involves under-valuing base rate information
- Bible code – Purported set of secret messages encoded within the Hebrew text of the Torah
- Bonferroni inequalities– Inequality applying to probability spaces
- Cherry picking – Fallacy of incomplete evidence
- Garden of forking paths fallacy[15]– side effect of too many researcher degrees of freedom
- Circular analysis – Error in statistical analysis
- HARKing – Acronym for "Hypothesizing after the results are known"
- Lincoln–Kennedy coincidences urban legend – Urban legend
- Look-elsewhere effect – Statistical analysis phenomenon
- Metascience – Scientific study of science
- Misuse of statistics – Use of statistical arguments to assert falsehoods
- Overfitting – Flaw in mathematical modelling
- Pareidolia – Perception of meaningful patterns or images in random or vague stimuli
- Post hoc analysis – Statistical analyses that were not specified before the data were seen
- Post hoc theorizing– when statistical hypotheses are tested with the same dataset that suggested them, they are likely to be accepted, even though not necessarily true due to circular reasoning
- Predictive analytics – Statistical techniques analyzing facts to make predictions about unknown events
- Texas sharpshooter fallacy – Statistical fallacy
Notes
- ^ Other names are data butchery, data fishing, selective inference, significance chasing, and significance questing.
References
- ISSN 0003-1305.
- ^ a b
PMID 12493654.
- ISSN 0028-792X. Retrieved 2023-10-01.
- Wall Street Journal. Archivedfrom the original on 2023-09-24. Retrieved 2023-10-08.
- ^ "APA PsycNet". psycnet.apa.org. Retrieved 2023-10-08.
- ^ .
- ^
Selvin, H. C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician. 20 (3): 20–23. JSTOR 2681493.
- ^ a b
Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. 26 (2): 217–236. S2CID 10350955.
- PMID 30856227.
- ^ Bohannon, John (2015-05-27). "I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How". Gizmodo. Retrieved 2023-10-20.
- ^
Taylor, J.; Tibshirani, R. (2015). "Statistical learning and selective inference". Proceedings of the National Academy of Sciences. 112 (25): 7629–7634. PMC 4485109.
- S2CID 28976450.
- ^ "Streamlined review and registered reports soon to be official at EJP". ejp-blog.com. 6 February 2018.
- ^ Vyse, Stuart (2017). "P-Hacker Confessions: Daryl Bem and Me". Skeptical Inquirer. 41 (5): 25–27. Archived from the original on 2018-08-05. Retrieved 5 August 2018.
- ^ Gelman, Andrew (2013). "The garden of forking paths" (PDF).
Further reading
- PMID 16060722.
- Head, Megan L.; Holman, Luke; Lanfear, Rob; Kahn, Andrew T.; Jennions, Michael D. (13 March 2015). "The Extent and Consequences of P-Hacking in Science". PLOS Biology. 13 (3): e1002106. PMID 25768323.
- Insel, Thomas (November 14, 2014). "P-Hacking". NIMH Director's Blog.
- Smith, Gary (2016). Standard Deviations: Flawed Assumptions, Tortured Data, and Other Ways to Lie with Statistics. Gerald Duckworth & Co. ISBN 9780715649749.
External links
- A bibliography on data-snooping bias
- Spurious Correlations, a gallery of examples of implausible correlations
- StatQuest: P-value pitfalls and power calculations on YouTube
- Video explaining p-hacking by "Neuroskeptic", a blogger at Discover Magazine
- Step Away From Stepwise, an article in the Journal of Big Data criticizing stepwise regression