Wikipedia:Proposed tools/Cvcheck

Cvcheck

A copyright tool, checks for text copied from the web.

Problem

AFAIK there are currently only two WP tools available to check articles for the presence of text copied from the Web. Both have limitations.

User:CorenSearchBot runs as a background task on newly created articles. A particular article can also be run thru it by adding its name to a queue, which article the bot will run, it states, when it has a free moment. Its major limitation is that because it's an automated task, it can't search Google or GBooks.

User:The Earwig's tool [1] is manually invoked. It searches Google, but not Gbooks. It would not have caught the material that caused the recent flap [2]. (Don't know whether CSbot would have caught it either). Its author is a student who has said they won't have time to improve its algorithm. It doesn't create permanent output (I realize that might pose a maintenance problem.) I'm not sure, but I think from looking at the code [3], that if it finds one match, it adds that url to an exclusion list. If true, this means that the person who'll try and clean the article will need to go on manually comparing the rest of the website to the article - it would be much more efficient to see every match.

Requirements

Check article sentences to see whether they were copied verbatim or close to verbatim from websites (excluding known WP mirrors and public domain) and books in Gprint. Create output: for each match: article section title, matching sentence or good-sized sentence fragment, and url. Optional but would be useful: a second pass option with checkboxes that would allow the user to exclude some of the match websites, because even if the usual WP mirrors are automatically excluded, one often sees random sites that have scraped WP.

While it's under development, or maybe after it goes live too, dump its search strings out somewhere; we could then contemplate why it didn't find a match where we would have expected it to and think of ways to further improve the algorithm. Novickas (talk) 15:23, 5 November 2010 (UTC)[reply]

I've added my enthusiastic support for this idea at the talk page. Something that searches Google Books would be particularly helpful, if this is technically feasible. The checkbox idea would also be useful, although to keep reports manageable I would suggest one difference here: rather than listing complete results and then having a second pass through with a checkbox wherein specific results are excluded, I would propose a brief results page with a checkbox that allows a second pass presenting a complete comparison. (I'm also dreaming of the day when somebody can create a tool to allow me to directly compare two URLs--including old article revisions and current ones; two different Wikipedia articles; a Wikipedia article and an identified external source). --Moonriddengirl ^(talk) 11:31, 6 November 2010 (UTC)[reply]

Interface design

Console-based, like Earwig's tool.

List of interested developers

Dcoetzee 01:08, 7 November 2010 (UTC)[reply]
Flatscan (talk) I have MediaWiki API and JavaScript experience, but I may be able to help with side tasks. 05:48, 8 November 2010 (UTC)[reply]
VernoWhitney 18:17, 8 November 2010 (UTC)[reply]

High-level architecture

(to be filled in by developers; what components will the tool have, and how will they interact?)

Implementation details

(to be filled in by developers; how will the tool be implemented? what technologies will be used and what implementation issues do you anticipate?)

Progress

Just this morning I implemented a basic prototype of this that seems to do a pretty good job. It doesn't yet account for a lot of things like detecting close paraphrasing or eliminating common phrases or proper names, but a few people have tried it and given good feedback. See:

Duplication Detector tool on Toolserver
Demonstration: [4]
Comparing to a PDF: [5]

It's based on a simple n-gram search algorithm, where the webpages are stripped down to text, split into a sequence of words, then an index data structure is built out of one of them by collecting for each pair of words all positions at which that word pair occurs. It then goes over the other document's sequence of words and at each position matches its current pair against each position that pair occurs in in the other document, extending it as far as possible. Finally, during the final listing it sorts by number of words in reverse order, and eliminates any search results that are substrings of search results already listed. PDFs are simply filtered through the existing pdftotext tool first. Dcoetzee 17:23, 21 March 2011 (UTC)[reply]