Wikipedia:Wikipedia Signpost/2012-11-26/Recent research
Movie success predictions, readability, credentials and authority, geographical comparisons
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
Early prediction of movie box-office revenues with Wikipedia data
An open-access preprint[1] has announced the results from a study attempting to predict early box-office revenues from Wikipedia traffic and activity data. The authors – a team of computational social scientists from Budapest University of Technology and Economics, Aalto University and the Central European University – submit that behavioral patterns on Wikipedia can be used for accurate forecasting, matching and in some cases outperforming the use of social media data for predictive modeling. The results, based on a corpus of 312 English Wikipedia articles on movies released in 2010, indicate that the joint editing activity and traffic measures on Wikipedia are strong predictors of box-office revenue for highly successful movies.
The authors contrast their early prediction approach with more popular real-time prediction/monitoring methods, and suggest that movie popularity can be accurately predicted well in advance, up to a month before the release. The study received broad press coverage and was featured in
Readability of the English Wikipedia, Simple Wikipedia, and Britannica compared
A study
The authors prepared a corpus of matching articles for the purpose of comparison between the English and Simple English Wikipedia. The study did not perform a random selection of articles, but selected a sample based on the existence of a corresponding article in Simple Wikipedia. The findings of the first analysis indicate that Simple Wikipedia consistently outperforms the English Wikipedia on all readability metrics. Wikipedia also appears to contain on average more proper nouns than Britannica – which, the authors speculate, may be due to specific editorial policies. The second section of the paper measures readability for 500 articles for each one of eight topic categories selected from DBpedia (biology, chemistry, computing, economics, history, literature, mathematics, and philosophy).
The comparison indicates that articles in the computing category are the most readable by syntactical and familiarity measures. Biology and chemistry, on the other hand, seem to include the most difficult articles. The final section reviews the readability of Britannica articles, in particular comparing the readability of articles in the "introductory" class with that of Simple Wikipedia articles and the readability of "encyclopedia" class articles with that of Wikipedia articles. The findings indicate that Britannica outperforms Wikipedia in readability overall, while introductory articles outperform Simple Wikipedia articles. It should be noted that the comparisons were not performed on matched pairs and that the the criteria used to sample articles from Britannica were not specified.
A paper whose preprint was
Wikipedia favors established views and scientifically backed knowledge
An article appearing in Information, Communication & Society
Using the grounded theory approach, the study focuses not on editors, but on their arguments. It finds that due to community-upheld Wikipedia policies such as Wikipedia:Reliable sources, dissenting opinions ("traditionally marginalized types of knowledge") such as various conspiracy theories are still marginalized or straight-out excluded; according to the author, this "did not lead to a ‘democratization’ of knowledge production, but rather re-enacted established hierarchies". The finding should be taken in a certain context; as the author notes, the article was written by amateurs ("lay participants"), who however decided to reproduce traditional knowledge hierarchies, relegating various conspiracy theories and similar points not backed up to reliable sources to obscurity on Wikipedia. The paper concludes that Wikipedia, like other encyclopedias, is prone to a "scientism bias", i.e. treating scientifically backed knowledge as "better" than knowledge coming from alternative outlets. This despite the "anyone can edit" motto of Wikipedia, the paper finds support for the argument that Wikipedia puts more stress on article quality than democratic participation, or in the words of the article: "Although laypeople apparently play a significant part in the text production, this does not mean that they favor lay knowledge. On the contrary, it is clearly elite knowledge of well-established authorities which is finally included in the article, whereas alternative interpretations are harshly excluded or at least marginalized."
Side-note: The study's use of a Firefox add-on Wired-Maker for content analysis rather ingenious, and applauds the mentioning of such a practical methodological tip in their paper.
Trust, authority and credentials on Wikipedia: The case of the Essjay controversy
At the Academy of Management conference in Boston, Dariusz Jemielniak presented a paper on Trust, Control, and Formalization in Open-Collaboration Communities: A Qualitative Study of Wikipedia
The working paper is the first in what Jemielniak suggests will be a series of papers based on a long-term
The paper paints a detailed, nuanced, and deeply informed portrait of Wikipedians' responses to the controversy and the ways in which trust and its relationships to authority and credentials are navigated in the project. The author suggests that the creation of rules and legalistic procedures allowed Wikipedians to walk the line between rejecting descriptions of authority per se while minimizing the effects of inaccurate descriptions of authority by suggesting that editors on Wikipedia should rely much more heavily on users' experience and on the degree to which particular contributions conform to Wikipedia's content guidelines.
A working paper by the same writer, presented at the annual meeting of the Society for Applied Anthropology[6] gives an overview of Wikipedia's culture by reviewing the role of its norms, guidelines and policies.
![](http://upload.wikimedia.org/wikipedia/commons/thumb/0/0f/Democrats_and_Republicans_in_Wikipedia_discussions_green.png/450px-Democrats_and_Republicans_in_Wikipedia_discussions_green.png)
Briefly
- Being Wikipedian is more important than the political affiliation: In a recent preprintpolitical affiliation when it comes to user pages. In contrast with other social media e.g., the blogosphere, where cross-party interactions are very much underrepresented, it appears that Wikipedian dialogues between editors from opposing parties are relatively profound and notable. On the day before the US presidential election, the paper's results were highlighted on the Wikimedia blog under the headline "In divisive times, Wikipedia brings political opponents together".
- Eye-tracking study: Readers look at TOC first, then infobox: A conference paper titled "Looking for genre: the use of structural features during search tasks with Wikipedia"[8] described the results of an eye tracking study, where readers looking for information in a Wikipedia article tended to look first at the table of contents, then at the article's infobox. Also, readers frequently "skim and scroll" long articles.
- Edit categories in featured and non-featured articles: This article focuses on some differences between featured and non-featured article. Unsurprisingly, the main finding is that the featured articles are more stable after promotion; the interesting contribution lies more in the detailed methodology and categorization of various types of edits.[9]
- How the TV schedule influences Wikipedia pageviews: In Germany, several recent consumer studies have found evidence for a rise of what has been called "second screen": The parallel use of TV and the Internet. To find a partial answer to the question whether this use is unrelated (e.g. checking emails while the TV is running in the background) or integrates both media, a blogger turned to pageview numbers for the German Wikipedia[10]. From a still unsystematic analysis, he draws two conclusions: "First, the use of Wikipedia is markedly influenced by the TV schedule. On Saturday evenings in particular, but sometimes also during weekdays, the most viewed Wikipedia entries contain many articles related to the currently showing TV program. Secondly, these articles are primarily viewed while the corresponding show is running on TV." The author also announced a Perl script to convert the raw pageview data provided by the Wikimedia Foundation into a MySQL database, demonstrated in a live list of the 50 most viewed articles of the German Wikipedia.
- A truthfulness verification system based on Wikipedia: Yang Liu's master's thesis (paywalled)[11]discusses the development of WT-verifier, a "truthfulness verification system based on Wikipedia" that uses information on Wikipedia, rather than general web searches to perform fact checking. Liu finds that Wikipedia "has high reliability of page contents, due to strict rules for page editing and a strong self-fixing mechanism" and adapts T-verifier, an existing system based on Yahoo! searches, applying it to information on Wikipedia. Liu develops what he calls a "truthfulness aware snippet generation algorithm" and finds that the new approach "significantly increases the precision and recall compared to the original T-verifier approach."
- Characterizing Wikipedia traffic: A paper presented at the 7th International Conference on Internet and Web Applications and Services[12] gives a breakdown of Wikipedia traffic for 2009 to the 10 largest wikis with a particular focus on content-type. The paper gives a high-level overview of Wikipedia traffic but they do not take the opportunity to dive deeper into the data. The current analysis from the paper can also be found on the Wikimedia Report Card and Wikimedia Statistics page. Suggestions for future research include the following: an in-depth analysis of the temporal dynamics of editing behavior. For example, do we see higher editor activity during holidays? An in-depth analysis of the multi-media files/Wikimedia Commons project. Are there differences between wiki projects regarding the use of Commons image files?
- One-year article ratings dump released: the Wikimedia Foundation announced the release of the complete, anonymous data dump of 11M article ratings collected over 1 year (July 2011 – July 2012) from the English Wikipedia via the CC0license.
- Measuring countries' visibility on Wikipedia: On his "Zero Geography" blog, researcher Mark Graham began a series of posts[14][15] comparing the "geography of views" for different countries on Wikipedia: "we constructed a list of every single article about a place (towns, monuments, historical events, rivers, buildings etc.) in the top 42 Wikipedia language versions, and then queried the number of views that each of those articles received over a two-year period (2009–2011)." (This is part of ongoing research into geographical aspects of the information on Wikipedia by Graham's team at the Oxford Internet Institute, and will be featured in an upcoming paper.) Content about US locations received the most views across languages, followed by the UK and Germany. Graham observed that the top 10 list by pageviews shows a lot of similarity to the top 10 lists of countries by number of articles, and by number of edits originating from that country, but noted that "the UK [being] Europe's most visible country ... is quite interesting because it isn't the country in Europe that uses Wikipedia the most (Germany does)", conjecturing that this might have to do with language differences.
- Ratio of African Wikipedia readers rising, but still low: Erik Zachte, data analyst at the Wikimedia Foundation, blogged an update about "Wikipedia page reads, breakdown by region"[16], observing among other things that "Africa still has a long way to go to gain equal access to internet: with about 15% of the worlds population, 1.4 % of Wikipedia page views is low, but still one and a half as much as 3 years ago."
Notes
- ^ Mestyán, M., Yasseri, T., & Kertész, J. (2012). Early Prediction of Movie Box Office Success based on Wikipedia Activity Big Data. ArXiV. PDF
- ^ a b Jatowt, A., & Tanaka, K. (2012). Is Wikipedia Too Difficult? Comparative Analysis of Readability of Wikipedia , Simple Wikipedia and Britannica. CIKM’12, pp. 2607–2610. PDF • DOI
- ^ Yasseri, T., Kornai, A., & Kertész, J. (2012). A Practical Approach to Language Complexity: A Wikipedia Case Study. PLoS ONE, 7(11), e48386. DOI
- ^ König, R. (2012). Wikipedia. Between lay participation and elite knowledge representation. Information, Communication & Society. Advance online publication. DOI
- ^ Jemielniak, D. (2012). Trust, Control, and Formalization in Open-Collaboration Communities: A Qualitative Study of Wikipedia. Academy of Management 2012 Annual Meeting. PDF
- ^ Jemielniak, D. (2012). Wikipedia: An effective anarchy. Society for Applied Anthropology 2012 Annual Meeting (SfAA 2012). PDF
- ^ a b Neff, J. G., Laniado, D., Kappler, K., Volkovich, Y., Aragón, P., & Kaltenbrunner, A. (2012). Jointly they edit: examining the impact of community identification on political interaction in Wikipedia. ArXiV, PDF
- ^ Clark, Malcolm; Ruthven, Ian; O’Brian Holt, Patrik and Song, Dawei (2012). Looking for genre: the use of structural features during search tasks with Wikipedia. Fourth Information Interaction in Context Conference (IIiX 2012). DOI • PDF
- ^ Daxenberger, J., & Gurevych, I. (2012). A Corpus-Based Study of Edit Categories in Featured and Non-Featured Wikipedia Articles. Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). PDF
- ^ Rycak, M. (17 November, 2012) Wikipedia-Zugriffszahlen bestätigen Second-Screen-Trend. martinrycak.de. HTML
- ^ Liu, Y. (2012). WT-verifier. Truthfulness verification of fact statements on Wikipedia (unpublished masters' thesis). State University of New York at Binghamton. HTML
- ^ Reinoso, A. J., Muñoz-Mansilla, R., Herraiz, I., & Ortega, F. (2012). Characterization of the Wikipedia Traffic. Seventh International Conference on Internet and Web Applications and Services (ICIW 2012), pp. 156–162. PDF
- ^ Taraborelli, D. (2012) Wikipedia article ratings. The Data Hub TSV
- ^ Graham, M. (5 November 2012). Virtuous Visible Circles: mapping views to place-based Wikipedia articles. Zero Geography. HTML
- ^ Graham, M. (11 November 2012). The most visible country in Europe (on Wikipedia) is... Zero Geography. HTML
- ^ Zachte, E. (15 November 2012) Wikipedia page reads, breakdown by region. Infodisiac. HTML
Discuss this story
R.König
The paper summary seems to convey the impression that R.König is a "far out there" ultra-relativist / strong programmist. Hope that's what was intended... AnonMoos (talk) 17:24, 28 November 2012 (UTC)[reply]
Cities traffic
"the UK [being] Europe's most visible country ... is quite interesting because it isn't the country in Europe that uses Wikipedia the most (Germany does)" - Perhaps it's because the Premier League is Europe's leading football league and British artists (especially actors and musicians) are much more famous than Germans. --NaBUru38 (talk) 18:05, 28 November 2012 (UTC)[reply]
Thanks
I always enjoy reading these interesting Recent Research Reports. Thank you to those who contribute to the reports! --Pine✉ 18:59, 28 November 2012 (UTC)[reply]