Wikipedia:AutoWikiBrowser/Database Scanner

Source: Wikipedia, the free encyclopedia.
Chapters: Core · Database scanner · Find and replace · Regular expressions · General fixes
Show example screenshot
  • Start — Searches the selected database dump based on the settings set in other option boxes
  • Pause
  • Reset

Parameters

Database

  • Database file — use the Browse button to specify where on your machine the database dump (XML format, XML file) you have downloaded. (likely from here)
    • The following are automatically read from the header of the XML file specified.
      • Site name — Example: "Wikipedia".
      • Base — Homepage of the site. Example: "https://en.wikipedia.org/wiki/Main_Page".
      • Generator — Software version that created the dump file. Example: "MediaWiki 1.42.0-wmf.26 (8f44039)".
      • Case — Casing configuration of site. Example "first-letter".

Namespaces

Show example screenshot

Select the namespaces you want to search within. If none are selected, the search will include all available namespaces. Please note that your dump file might not contain data for every namespace available on your wiki.

Title matching

Show example screenshot
  • Title does contain — Restrict the search to titles containing the text, or matching the text if the Regex option is used.
  • Title does not contain — Restrict the search to titles NOT containing the text, or NOT matching the text if the Regex option is used.
  • RegexAWB Regex help
  • Case sensitive — Whether the text/matching pattern should be case sensitive.

Revision

Show example screenshot

Last edited date

  • Search date — Tick to restrict the search to pages with a revision (last edited) date between a range.
    • From — Start date of range.
    • To — End date of range.

Text

Show example screenshot

Text searching

  • Contains — %%title%%, %%key%%, %%titlename%% and %%namespace%% work if search is not regex
  • Not contains — %%title%%, %%key%%, %%titlename%% and %%namespace%% work if search is not regex
  • RegexAWB Regex help
  • Singleline — Changes meaning of "." so it matches all characters, as opposed to all apart from newlines
  • Case sensitive — Enables case sensitivity
  • Multiline — Changes meaning of "^" and "$" so they represent the beginning and end respectively of every line, rather than just of the entire string
  • Ignore <!-- comments -->

Page text properties

  • Characters
  • Links
  • Words

Searching

Show example screenshot

AWB specific

  • None — will just list all the pages in the database dump (that match other scan filter criteria)
  • Has title AWB will embolden
  • Has links AWB will simplify — allows you to search a DB dump for links that can be simplified, e.g.:
  • Simplifies links like [[Dog|Dog]] to [[Dog]]
  • Simplifies links like [[Dog|Dogs]] to [[Dog]]s
  • Has bad links AWB will fix
  • Has HTML entries
  • Section error
  • Unbulleted links — will search a database dump for any pages that have external links which are not bullet pointed
  • Typo — allows you to search a database dump for spelling mistakes, in the same way that AWB can when RegexTypoFix is enabled
  • Missing {{
    defaultsort
    }}

Other options

  • Start from page — Starts from an entered page name. The dump is scanned until the specified page is found, then the scan continues as normal using the other search settings. Scanning until a page is found is faster than scanning using the full settings, however the dump file up to the page has to be read, so this will still take time (approximately 30 seconds per gigabyte of XML data, depending on your system's CPU speed).
  • Limit results to — Limits the number of results that will be found displayed from the database dump. If the limit is reached the scan will stop early.

Restriction

Show example screenshot

Allows for pages with edit restrictions (semi-protected, fully protected etc.) to be searched for.

Help

Show example screenshot

Some URL links to relevant dump help pages.

Output

Performance

The speed of the database scanner mainly depends on two factors of the system it's run on:

  1. CPU
    single-threaded performance
  2. hard disk
    read speed.

Example performance: Intel

MB/s
disk sequential read

So, with a reasonable 2010-era or later CPU, AWB will read the database XML dump file at around 30 MB/s and be CPU limited. Therefore, if reading the database file from a networked storage area, database scan performance will be reduced if the network transfer speed is below this speed. When reading the database XML dump file from a local disk, modern mechanical hard disks can normally provide sequential read speeds well above 30 MB/s, therefore the database scan speed will be CPU-limited.

The database scanner is multi-threaded: the database scanner uses the main thread to read the database XML file from disk, and additional thread(s) to search the articles based on the user's search criteria, total threads equalling the number of CPU cores (e.g. if quad core CPU without hyperthreading then 1 main and 3 secondary threads). The main thread will pause XML reading and contribute to article searching if the secondary threads get too far behind. This happens if searching the article based on the search criteria is slower than reading the article from the XML file; typically this is the case. For the example of the Core i5 520M this does occur, database scanner performance is limited to how fast all the threads can search the articles, so overall performance is limited to the multi-threaded performance of the CPU.

A CPU with more cores, and/or better performance from each core would improve database scanner performance.

Results

  • Filter — allows you to filter the results found from the DB Dump. The options are the same for the normal AWB list filter
  • Save — saves the list as a text document
  • Clear — clears the list of pages

Convert

  • Add headings every — adds a heading every x lines
  • Alphabetised headings
  • # — makes a list with # before each page name, if placed on a wiki page, this will number the lines
  • * — makes a list with ** before each page name, if placed on a wiki page, this will bullet point the lines
  • A B C... headings — adds headings == heading == for page names beginning with that letter
  • Make — makes the list
  • Copy — copies the list to the users clipboard for copying and pasting into another document
  • Save — saves the list as a text document
  • Clear — removes all pages from the page list