Wikipedia:Wikipedia Signpost/2023-08-01/In focus
Journals cited by Wikipedia
In 2009, I had an idea I thought was pretty neat. What if we looked at all the |journal=
parameters of citation templates? Nearly 14 years later, it's more than time to share what that idea became. Here, then, is a historical tour of one of Wikipedia's interesting little secrets.
An idea is born
The idea was born out of a desire to understand which journals were highly cited on Wikipedia, so
After
|journal=
parameters, a most popular journal listing, and a most popular missing journal listing.
Sample outputs of the first decently-accurate run (30 June 2009):
- Most popular journals, covering the top 100 journals
- Most popular missing journals, covering the top 100 missing journals
- A1 the first page of the alphabetical listings.
The initial output was crude and inaccurate (especially before the above date), but it was good enough to get us started. WikiStatsBOT was quickly improved to
The early days
The next and last run didn't occur until May 2010. ThaddeusB then abruptly became inactive (a reminder that bus factors of 1 are bad), leaving us without both bot and coder. We still worked with what we had, clearing the first 500 journals by the end of December of that year.
A new bot request was made, looking for a bot to take over WikiStatsBOT's old task. I used the opportunity to bring in new ideas, and redesign some of its functionality and visual appearance. After a few days, a coder was found in JLaTondre. A BRFA was filed, and in July 2011, the JL-Bot unleashed its 0s and 1s in service of WikiProject Academic Journals.
Sample output of the first decently-accurate run (10 July 2011):
Again, the bot was quickly improved to clean up some entries, have better accuracy, and present things in a more appealing and useful way.
Modern era
Over time, new sub-compilations were designed to browse the data according to different criteria.
- A by-target compilation(July 2017)
- A compilation of all redirects pointing to the same target page
- The Wikipedia CiteWatch(August 2018)
- A compilation of questionable and unreliable sources (see previous Signpost coverage)
- A by-publisher compilation(April 2019)
- A compilation aiming to group all the journals of a publisher together
- Various maintenance compilations(August 2019)
- Used to clean up unusual, weird, or known-to-be-wrong stuff
- A by-DOI prefix compilation(December 2019)
- JSTOR
- A list of CrossRef(January 2020).
- This is not part of the JCW compilation proper, but it is used to create redirects from DOI prefixes used by the compilation.
Those can all be easily accessed through the current
As of writing, the compilation covers about 3.3 million citations, with 1.5 million distinct DOIs, with 7,290 distinct DOI prefixes. This is nearly ten times the initial coverage we had in 2009, which reflects the expansion Wikipedia had since (both in the number of articles and in the number of citations per article). For posterity,
Summary of the current compilation, based on the 20 July 2023 dump | |||||
---|---|---|---|---|---|
Most cited publishers | Citations[n 1] | Most cited journals | Citations[n 2] | Most cited missing journals | Citations[n 3] |
Elsevier | 360,000
|
Nature | 51,000
|
The NamesforLife Abstracts | 1974
|
Springer Science+Business Media | 286,000
|
Proceedings of the National Academy of Sciences of the United States of America | 40,000
|
Cesa News | 824
|
Wiley | 255,000
|
Science | 37,000
|
New Zealand Journal of Geology and Geophysics | 534
|
Nature Research |
118,000
|
Journal of Biological Chemistry | 33,000
|
The Real Estate Record: Real Estate Record and Builders' Guide | 509
|
Informa | 112,000
|
The Astrophysical Journal | 23,000
|
Memoirs of the American Entomological Institute | 505
|
You might say, "but wait, those redlinks contains things that aren't journals!" Well, read on to find out more. I will however, take a small pause here to thank various people that helped with the development of the compilation in one way or another over the years.
First
How does it work, exactly?
Understanding what exactly the compilation is is important. As mentioned above, it's a searchable compilation of all |journal=
parameters from citation templates on the English Wikipedia, taken from the latest
10.xxxx/...
part of DOIs). It is based on citations like:
<ref name=Bloom1969>{{cite journal |last1=Bloom |first1=E. D. |display-authors=etal |year=1969 |title=High-Energy Inelastic e–p Scattering at 6° and 10° |journal=Physical Review Letters |volume=23 |issue=16 |pages=930–934 |doi=10.1103/PhysRevLett.23.930}}</ref>
It will, however, ignore named-reference repeats like <ref name=Bloom1969/>
, as well as "manual" citations like
<ref>Bloom, E. D. et al. "High-Energy Inelastic e–p Scattering at 6° and 10°". Physical Review Letters, 23 (16): 930–934. doi:10.1103/PhysRevLett.23.930</ref>
There is also limited support for semi-manual citations involving {{doi}} and {{doi-inline}}, like:
<ref>Bloom, E. D. et al. "High-Energy Inelastic e–p Scattering at 6° and 10°". Physical Review Letters, 23 (16): 930–934. {{doi|10.1103/PhysRevLett.23.930}}</ref>
Then some cleanup and processing is done:
|journal=[[Foo|Bar]]
is treated as|journal=Bar
- whitespace, and certain templates like {{small}} are stripped and normalized
- Fuzzy logic is used to match likely typos and likely related entries
- For the purpose of matching, common terms are normalized (Bulletin = Bull., Catalogue = Catalog, Journal = J., Proceedings = Proc., etc.) unless an article/redirect exists
- For the purpose of matching, supplements and sections are treated as their base publications (Acta Foobarol. Suppl. = Acta Foobarol., MNRAS Letters = MNRAS, J. Phys. A = J. Phys.) unless an article/redirect exists
- Matching ignores common articles like an, the, and, &; likewise for other languages (French le, la, l', German für, etc.)
- WP:JCW/EXCLUDEis used to unmatch entries that don't belong together. For example, African Journal of Arts will be a fuzzy-logic match for American Journal of Arts, even though nobody with a working brain would think these were the same.
Matching is not perfect, so you'll often find mismatched entries like:
2842 | Nature Sustainability |
|
When these are found, they can be bypassed in
The |journal=
parameter will often be misused for books, magazines, newsletters, websites, or contains wrong/extraneous data like authors/publisher/volume/page. We try to identify what type of publication we're dealing with in the
Additional information on how to read the compilation can be found at the bottom of each page in the compilation, as well as on the compilation's
How is it used?
The main historical use of the compilation was to find highly cited missing journals. That is still the case today. But so much more can now be done, particularly on cleanup:
- Finding common typos, misspellings, miscapitalizations, using WP:JCW/MISCAPS. (See previous Signpost coverage.)
- Finding unusual typos, misspellings or miscapitalizations. For example, as of writing, the 10.4401 doi prefix entry lists Annals of Geophysics (97 in 91) and Annals of Geophysics Journal (1 in 1). One might suspect (correctly) that Annals of Geophysics Journal is the wrong name of the journal being cited.
- Finding books being cited as journals, with ISBNs in the journal parameter
- Finding journals with the wrong DOI, or DOIs with the wrong journal.
- Finding former names of journals.
- Finding ISO 4, Bluebook, MathSciNet, or US National Library of Medicine abbreviations and other (often incorrect) abbreviations of journals.
- Creating redirects from Foobar Journal to The Foobar Journal and vice versa
Citation bot and JCW-CleanerBot will often be seen doing cleanup based on these compilation.
Where to go from here?
Well, the first natural extension would be
But for now, I hope that you'll have fun exploring the compilation, and perhaps decide you want to tackle the many invalid titles, or clean up the many proceedings cited as journals. Feel free to share your experiences with JCW or suggest improvements to the compilations in the comment section!
Discuss this story
References