Wikipedia:Reliability of open government data

Source: Wikipedia, the free encyclopedia.

Wikipedia fundamentally relies on the use of what we call reliable sources. We are starting to use more and more open data from government sources, as illustrated in the COVID-19 pandemic. But shouldn't we clearly distinguish between "reliable" data and "official" data? When can government agencies be trusted to provide reliable data? COVID-19 pandemic daily infection counts lack credibility for several countries around the world:[1][2] how should Wikipedia readers be warned?

Sep 2021: constructive editing of this essay

is welcome
, but it is not intended as a support/oppose survey. Please edit or insert arguments and counterarguments, preferably with sources, into prose and/or lists. Individual sections on the talk page could be used for support/oppose type discussions, with summaries later being inserted into the essay itself.

The COVID-19 pandemic case

During the COVID-19 pandemic that dominated world news starting in 2020, some of the key pieces of knowledge that readers have sought and editors have provided are the daily counts of how many people have been infected or died in countries around the world. Numerous media sources in specific countries point to particular worries about the data from several countries, and the Wikipedia editing generally follows the usual pattern of judging the reliability of particular media sources, doctors' statements, citizens' groups statements, rather than relying on government agencies' statements alone. However, the key diagrams and the numbers that feed through to global numbers on the pandemic are not nuanced by the unreliability of some of the data.

The

Press Freedom Index is, the more likely it is to lack day-to-day random fluctuations (stochastic noise) in its official COVID-19 daily infection counts. Presumably, government agencies with less risk of press criticism are less worried about fabricating their official open data.[1]

In this particular case, switching to WHO or Johns Hopkins University CSSE (JHU CSSE) data would not be a solution for finding unfabricated data, because WHO is restricted to providing official national data, and JHU CSSE data shows broadly similar results of suspiciously low-noise daily counts to those of the WP C19CCTF; in fact, the statistical significance of the relation between the Press Freedom Index and low noise is stronger with the JHU CSSE version of the data - see the appendices in the analysis, which aims to be fully reproducible from source data and source code.[1]

What should Wikipedia policy be?

Terminology: reliable vs official

Is it acceptable that we continue to use the term "reliable" (18 Jan 2021) when we really mean "official" (from a government or governmental agency), and we know that "official" in many cases may mean quite likely falsified? Are we contributing to disinformation if we fail to clearly warn readers that "official" information may be fictitious? Should we trust official open government data by default, or should we distrust it by default?

The COVID-19 pandemic is not the only example of government open data used in Wikipedia, and these questions are likely to become more relevant as citizens increasingly pressure governments to publish open data.

Templates

We could create a template with a mouseover, something like {{

fv
}}, with a superscript message something like govt and a longer mouseover message something like Official information from a governmental institution or agency; "official" information may or may not be reliable.

Official sources noticeboard

Should we have a noticeboard to develop official sources ratings lists something like WP:RSP? This would need enough volunteers willing to rate specific government agencies, or specific governments or countries, and enough information to warn Wikipedians of potential personal and legal security risks involved in them accusing their governments of fabricating data. The debates could risk becoming extremely controversial and subject to the usual risks of controversial Wikipedia topics.

Usage

Elections

The overall and detailed numbers of votes in elections for political office are a form of open government data for which

reliable sources
independent of the government.

Robots and search engines and websites that feed off machine-readable Wikipedia infoboxes process and propagate the infobox numerical data, but as of 2021, don't propagate the prose information. The prose information is what contains warnings about the information being (in some cases) highly unreliable (except in the sense that the information is a reliable report on the government agency's claim about the data).

COVID-19 pandemic

It can reasonably be argued that the COVID-19 pandemic data currently (Sep 2021) in Wikipedia is reliable in the sense that it represents the governments' points of view on their pandemic statistics. However, would the use of better terminology or some good templates be enough to warn users that the data may be nonsense in some cases, so that we are not contributing to official governmental disinformation?

It would be aesthetically upsetting if we had to exclude COVID-19 pandemic data from those countries whose data is most suspicious, and would risk accusations of

pro-Western bias, even if the decisions were based on purely statistical properties of the official government data.[4][1][5][2]

Bayesian option

A possible approach could be to associate a Bayesian probability for the credibility of each source of open government data, where the individual probabilities are generated from peer-reviewed research,[5][1][4][2] preprint research (itself with a lower Bayesian probability of being correct), and media articles (with bayesian probabilities related to WP:RSP?). Would there be enough people from diverse backgrounds and with the editing capabilities and the enthusiasm to get these data into Wikidata? Currently (Sep 2021), Wikidata elements are subject to much less editorial debate than Wikipedia articles.

Infoboxes for elections, pandemic data or other open government data could have a parameter |credibility_percent = 3 | credibility_refs = <ref name="JStats_Bloggs2017" /> that displays a probability either as a percentage (3% in this case) or as a decimal in the range from 0 to 1, and gives a median
(more robust than the mean) credibility estimate based on one or more references. As in ordinary Wikipedia editing, the parameter would quite likely be subject to intense debate on source reliability, how to express the overall value, and so on, depending on the quality of sources for individual open government data articles.

Openness and verifiability of the credibility research itself

En.Wikipedia generally considers any peer-reviewed research by a reputable research journal to be reliable, without requiring that the research paper be open access, and without requiring that the specific data sources, input parameters and method be presented in a fully reproducible format. Given the risk of initially relying on a small number of research papers in what is as of 2022 a small research field, we could require much higher standards than are typically considered enough. We could require that both:

  1. the research papers would necessarily have to be open access
  2. the research papers would have to be fully reproducible in the "narrower scope": Any results should be documented by making all data and code available in such a way that the computations can be executed again, yielding identical results, by any independent researcher with basic scientific computing skills

How do we combine different researchers' assessments?

If we use the credibility estimates from a single research paper by a single research group (or researcher), then we introduce a high element of sensitivity to error in that one research paper: if the paper is wrong, then that feeds through to a whole range of articles.

If we use the credibility estimates from multiples research papers, then how do we combine them? One solution would be to assign credibility parameters to each of the research papers and/or researchers, and take weighted medians (

WP:SYNTH. There would have to be strong consensus on the method and algorithm. Or we could include ranges or the interquartile range
or the central 95% range if there is a high number of research papers.

Policies

Should there be any specific Wikipedia guideline or policy distinguishing "reliable" versus "official" data? Some sort of text label to clarify the distinction?

Reliable sourcing versus geographical bias dilemma

COVID-19 data is generally more dubious in countries with worse press freedom,

known geographic biases
of the English-language Wikipedia. If we don't remove it, then we risk presenting unreliable data as being reliable while appearing to provide less biased encyclopedic coverage. This dilemma is similar to the usual sourcing dilemma in relation to these biases, with the difference that numbers can give the false illusion of being reliable, since numbers can give the impression of being more objective than words. (Numbers obtained and presented accurately, are, of course, at the heart of most of modern science; but there is a huge caveat in the word "accurately".)

Negotiation with other editors on where to compromise, on a case-by-case or topic-by-topic basis on

talk pages
, with standards evolving with time, is the one way to handle this dilemma.

See also

References