Wikipedia:Wikipedia Signpost/2014-10-15/Technology report

Source: Wikipedia, the free encyclopedia.
Technology report

<big>Attempting<ref>{{citation needed</ref>}} to parse <code>wikitext</code></big>

This week we sat down with The Earwig to learn about his wikitext parser, mwparserfromhell.

What is mwparserfromhell, and how did it get its name?

mwparserfromhell (which I will abbreviate as mwpfh) is a Python parser for wikicode. In short, it allows bot developers (like those using pywikibot) to systematically analyze and manipulate wikitext, even in cases where it is complex or ambiguous.
For example, let's say we want to see if a page transcludes a particular template, check whether it has a particular parameter, and if not, add it. A classic application would be a bot that dates {{citation needed}} tags. This isn't as simple as it sounds! A naive solution might use regexes, but then we need to check whether the parameter exists between the template's opening and closing brackets, but not get confused if it's inside of a template contained within the template (for example, if you had {{citation needed|reason=This fact is important.{{citation needed|date=October 2014}}}}), whether the template is between <nowiki> tags, and so on...
mwparserfromhell makes this easy by creating a tree representation of the wikicode (loosely described as a parse tree) that can be converted back to wikicode after any modifications are made. It focuses on being as accurate as possible, both in terms of the tree representation being accurate, and the outputted wikicode being as similar to the original as possible.
Its name comes courtesy of Σ, reflecting the somewhat insane nature of the project, and as an excuse for its frightening codebase.

What led you to develop it in the first place?

I’ve been writing bots and tools/scripts for many years – situations like the one above come up a lot. Sure, ad hoc solutions using regexes work sometimes, but I wanted something that would work in more general cases. mwparserfromhell seemed like a project that would be useful to most bot developers, and of which there was no existing equivalent.

What were some of the challenges you faced or things that didn't go according to plan while developing the parser? How did you manage them?

Oh, boy. It turns out that wikicode is a horrible, horrible language, for people and computers alike. It lacks a clear definition of how certain edge cases should be handled, and since mwparserfromhell’s goal is to be accurate, a lot of time was spent just trying to figure out how MediaWiki works. Many language parsers are designed to give up once they see a syntax error, like a missing bracket somewhere, but MediaWiki considers all possible wikitext to be valid, so a lot of mwpfh’s code involves making sense of some very questionable things (like templates nested inside of HTML tag attributes nested inside of external links, or the difference between {{{{{foo}}bar}}} and {{{{{foo}}}bar}}) and handling them as closely as possible to the way MediaWiki does. Sometimes this is hard, but other times it is outright impossible and we have to make guesses. For example, if we imagine that the template {{close ref}} transcludes </ref> and the parser encounters the wikicode <ref>{{cite web|…}}{{close ref}}, it will appear as if the <ref> tag does not end, even though it does. This is a limitation inherent in the nature of parsing wikicode: we have no knowledge of the contents of the template, so we can't figure out every situation. mwparserfromhell compromises as best as it can, by treating the <ref> tag as ordinary text and fully parsing the two templates.

How does mwparserfromhell compare to other re-implementations of the MediaWiki parser, like Parsoid?

Most projects like Parsoid (or MediaWiki’s own PHP parser) are designed to convert wikicode to HTML so that it can be viewed or edited by users. mwparserfromhell converts wikicode into a tree structure for bots, and that structure must contain enough information (such as HTML comments, whitespace, and malformed syntax that other parsers would outright ignore or try to correct) for it to be manipulated and converted back to wikitext with no unintentional modifications. Furthermore, it has less awareness of context than other parsers: because it is designed to deal with wikicode on a fairly abstract level, it doesn't know the contents of a template and can't make any substitutions. As noted above, this causes problems sometimes, but it's necessary for the parser to be useful to bots that are manipulating the templates themselves.

What is the most significant challenge that mwparserfromhell currently faces, and why?

It’s a difficult, exhausting project that would ideally have multiple people working on it. Development has stalled recently as I've been busy with college, and additional eyes would be useful to point out potential issues or help out with open problems.

What's next for mwparserfromhell? Do you have any other cool projects you'd like to tell us about?

Some wikitext constructs (primarily tables, but also parser functions and #REDIRECTs) aren’t understood by mwparserfromhell, so I would like to implement those. There’s actually an open request to review some code for table support that I've been procrastinating on for a couple months now. Other than that, I have some plants to make it more efficient; mwpfh has some speed issues with ambiguous syntax on large pages.
My copyvio detection tool on Wikimedia Labs (which uses mwparserfromhell, by the way!) has seen a lot of improvements lately, including more accurate detection, more detailed search results, and a fresh new API. If you don't know about it or have only used it in the past, I invite you to give it a spin.