Data dredging

Source: Wikipedia, the free encyclopedia.
venomous spiders

Data dredging (also known as data snooping or p-hacking)

statistical tests on the data and only reporting those that come back with significant results.[2]

The process of data dredging involves testing multiple hypotheses using a single data set by exhaustively searching—perhaps for combinations of variables that might show a correlation, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.

Conventional tests of

spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term p-hacking (in reference to p-values) was coined in a 2014 paper by the three researchers behind the blog Data Colada, which has been focusing on uncovering such problems in social sciences research.[3][4][5]

Data dredging is an example of disregarding the multiple comparisons problem. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.[6]

Types

Drawing conclusions from data

The conventional

significance test
is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis).

A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every data set contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same statistical population, it is impossible to assess the likelihood that chance alone would produce such patterns.

For example,

flipping a coin
five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. It is important to realize that the statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

Hypothesis suggested by non-representative data

Suppose that a study of a

random sample
of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data dredging might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college."

The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be

reproducible
; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

Systematic bias

Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment, abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks.[6] This problem can be very severe, for example, in the observational study.[6][2]

Missing factors, unmeasured

confounders, and loss to follow-up can also lead to bias.[6] By selecting papers with significant p-values, negative studies are selected against, which is publication bias
. This is also known as file drawer bias, because less significant p-value results are left in the file drawer and never published.

Multiple modelling

Another aspect of the conditioning of

Examples

In meteorology and epidemiology

In meteorology, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's predictive power versus the null hypothesis. This process ensures that no one can accuse the researcher of hand-tailoring the predictive model to the data on hand, since the upcoming weather is not yet available.

As another example, suppose that observers note that a particular town appears to have a

demographic data
about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a p-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a p-value less than 0.01 for many null hypotheses.

In sociology

Another way to flatten a p-curve is to control for

inversely proportional
to p-values, meaning higher t-values (above 2.8) indicate lower p-values. By controlling for gender, one can artificially inflate the t-value, thus artificially deflating the p-value as well.

Appearance in media

One example is the chocolate weight loss hoax study conducted by journalist John Bohannon, who explained publicly in a Gizmodo article that the study was deliberately conducted fraudulently as a social experiment.[10] This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This study was published in the Institute of Diet and Health. According to Bohannon, to reduce the p-value to below 0.05, taking 18 different variables into consideration when testing was crucial.

Remedies

While looking for patterns in data is legitimate, applying a statistical test of significance or

out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of cross-validation
and is often termed training-test or split-half validation.)

Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance (alpha) by this number; this is the

Tukey method. To avoid the extreme conservativeness of the Bonferroni correction, more sophisticated selective inference methods are available.[11] The most common selective inference method is the use of Benjamini and Hochberg's false discovery rate
controlling procedure: it is a less conservative approach that has become a popular method for control of multiple hypothesis tests.

When neither approach is practical, one can make a clear distinction between data analyses that are

confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former.[8]

Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated by the same method used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.

Academic journals increasingly shift to the

registered report format, which aims to counteract very serious issues such as data dredging and HARKing, which have made theory-testing research very unreliable. For example, Nature Human Behaviour has adopted the registered report format, as it "shift[s] the emphasis from the results of research to the questions that guide the research and the methods used to answer them".[12] The European Journal of Personality defines this format as follows: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes."[13]

Methods and results can also be made publicly available, as in the open science approach, making it yet more difficult for data dredging to take place.[14]

See also

Notes

  1. ^ Other names are data butchery, data fishing, selective inference, significance chasing, and significance questing.

References

  1. ISSN 0003-1305
    .
  2. ^ a b
    PMID 12493654
    .
  3. . Retrieved 2023-10-01.
  4. from the original on 2023-09-24. Retrieved 2023-10-08.
  5. ^ "APA PsycNet". psycnet.apa.org. Retrieved 2023-10-08.
  6. ^ .
  7. ^ Selvin, H. C.; Stuart, A. (1966). "Data-Dredging Procedures in Survey Analysis". The American Statistician. 20 (3): 20–23.
    JSTOR 2681493
    .
  8. ^ a b Berk, R.; Brown, L.; Zhao, L. (2009). "Statistical Inference After Model Selection". J Quant Criminol. 26 (2): 217–236.
    S2CID 10350955
    .
  9. .
  10. ^ Bohannon, John (2015-05-27). "I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here's How". Gizmodo. Retrieved 2023-10-20.
  11. ^ Taylor, J.; Tibshirani, R. (2015). "Statistical learning and selective inference". Proceedings of the National Academy of Sciences. 112 (25): 7629–7634.
    PMC 4485109
    .
  12. .
  13. ^ "Streamlined review and registered reports soon to be official at EJP". ejp-blog.com. 6 February 2018.
  14. ^ Vyse, Stuart (2017). "P-Hacker Confessions: Daryl Bem and Me". Skeptical Inquirer. 41 (5): 25–27. Archived from the original on 2018-08-05. Retrieved 5 August 2018.
  15. ^ Gelman, Andrew (2013). "The garden of forking paths" (PDF).

Further reading

External links