Wikipedia:Bots/Requests for approval/Bot1058 8: Difference between revisions

Source: Wikipedia, the free encyclopedia.
Content deleted Content added
→‎Discussion: another note
Line 48: Line 48:
*'''Note'''. This process is dependent on my computer maintaining a connection with a Toolforge bastion. Occasionally my computer becomes disconnected for unknown reasons, and when I notice this I must manually log back in to the bastion. If my computer becomes disconnected from the bastion for an extended time, this process may fall behind the expected <code>page_links_updated</code> dates. – [[User:Wbm1058|wbm1058]] ([[User talk:Wbm1058|talk]]) 11:55, 12 July 2022 (UTC)
*'''Note'''. This process is dependent on my computer maintaining a connection with a Toolforge bastion. Occasionally my computer becomes disconnected for unknown reasons, and when I notice this I must manually log back in to the bastion. If my computer becomes disconnected from the bastion for an extended time, this process may fall behind the expected <code>page_links_updated</code> dates. – [[User:Wbm1058|wbm1058]] ([[User talk:Wbm1058|talk]]) 11:55, 12 July 2022 (UTC)
*'''Another note'''. The purpose/objective of this task is to keep the [[mw:Manual:pagelinks table|pagelinks]], [[mw:Manual:categorylinks table|categorylinks]], and [[mw:Manual:imagelinks table|imagelinks]] tables reasonably-updated. Regenerating these tables for English Wikipedia using the [[mw:Manual:rebuildall.php|rebuildall.php]] maintenance script is not practical for English Wikipedia due to its huge size. Even just running the [[mw:Manual:RefreshLinks.php|RefreshLinks.php]] component of rebuildall is not practical due to the database size (it may be practical for smaller wikis). The goal of [[phab:T159512]] (Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp) is to make it practical to run RefreshLinks.php on English Wikipedia. My two scripts find the pages that haven't been updated since a timestamp, and then purge these pages with recursive link updates. Recursive link updates is what refreshLinks.php does. – [[User:Wbm1058|wbm1058]] ([[User talk:Wbm1058|talk]]) 14:42, 16 July 2022 (UTC)
*'''Another note'''. The purpose/objective of this task is to keep the [[mw:Manual:pagelinks table|pagelinks]], [[mw:Manual:categorylinks table|categorylinks]], and [[mw:Manual:imagelinks table|imagelinks]] tables reasonably-updated. Regenerating these tables for English Wikipedia using the [[mw:Manual:rebuildall.php|rebuildall.php]] maintenance script is not practical for English Wikipedia due to its huge size. Even just running the [[mw:Manual:RefreshLinks.php|RefreshLinks.php]] component of rebuildall is not practical due to the database size (it may be practical for smaller wikis). The goal of [[phab:T159512]] (Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp) is to make it practical to run RefreshLinks.php on English Wikipedia. My two scripts find the pages that haven't been updated since a timestamp, and then purge these pages with recursive link updates. Recursive link updates is what refreshLinks.php does. – [[User:Wbm1058|wbm1058]] ([[User talk:Wbm1058|talk]]) 14:42, 16 July 2022 (UTC)
*{{BotTrial|days=30}} Let's see if anything breaks. [[User:Primefac|Primefac]] ([[User talk:Primefac|talk]]) 16:24, 6 August 2022 (UTC)

Revision as of 16:24, 6 August 2022

Operator: Wbm1058 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 02:36, Saturday, June 25, 2022 (

UTC
)

Automatic, Supervised, or Manual: automatic

Programming language(s): PHP

Source code available: refreshlinks.php, refreshmainlinks.php

Function overview: Purge pages with recursive link update in order to refresh links which are old

Links to relevant discussions (where appropriate): User talk:wbm1058#Continuing null editing, Wikipedia talk:Bot policy#Regarding WP:BOTPERF, phab:T157670, phab:T135964, phab:T159512

Edit period(s): Continuous

Estimated number of pages affected: ALL

Exclusion compliant (Yes/No): No

Already has a bot flag (Yes/No): Yes

Function details: This task runs two scripts to refresh English Wikipedia page links. refreshmainlinks.php null-edits mainspace pages whose page_links_updated database field is older than 32 days, and refreshlinks.php null-edits all other namespaces whose page_links_updated database field is older than 80 days. The 32- and 80-day figures may be tweaked as needed to ensure more timely refreshing of links or reduce load on the servers. Each script is configured to edit a maximum of 150,000 pages on a single run, and restart every three hours if not currently running (thus each script may run up to 8 times per day).

Status may be monitored by these Quarry queries:


Discussion

I expect speedy approval, as a technical request, as this task only makes null edits. Task has been running for over a month. My main reason for filing this is to post my source code and document the process including links to the various discussions about it. – wbm1058 (talk) 03:02, 25 June 2022 (UTC)[reply]

  • Comment: This is a very useful bot that works around long-standing feature requests that should have been straightforward for the MW developers to implement. It makes sure that things like tracking categories and transclusion counts are up to date, which helps gnomes fix errors. – Jonesey95 (talk) 13:30, 25 June 2022 (UTC)[reply]
  • Comment: My main concerns are related to the edit filter; I'm not sure whether that looks at null edits or not. If it does, it's theoretically possible that we might suddenly be spammed by a very large number of filter log entries, if and when a filter gets added that widely matches null edits (and if null edits do get checked by the edit filter, we would want the account making them to have a high edit count and to be autoconfirmed, because for performance reasons, many filters skip users with high edit counts).

    To get some idea of the rate of null edits: the robot's maximum editing speed is 14 edits per second (150000 × 8 in a day). There are 6,821,450 articles, 60,629,756 pages total (how did we end up with almost ten times as many pages as articles?); this means that the average number of edits that need making per day is around 825000 per day, or around 9.5 per second. Wikipedia currently gets around 160000 edits per day (defined as "things that have an oldid number", so including moves, page creations, etc.), or around 2 per second. So this bot could be editing four times as fast as everyone else on Wikipedia put together (including all the other bots), which would likely be breaking new ground from the point of view of server load (although the servers might well be able to handle it anyway, and if not I guess the developers would just block its IP from making requests) – maybe a bit less, but surely a large proportion of pages rarely get edited.

    As a precaution, the bot should also avoid null-editing pages that contain {{subst: (possibly with added whitespace or comments), because null edits can change the page content sometimes in this case (feel free to null-edit User:ais523/Sandbox to see for yourself – just clicking "edit" and "save" is enough); it's very hard to get the wikitext to subst a template into a page in the first place (because it has a tendency to replace itself with the template's contents), but once you manage it, it can lay there ready to trigger and mess up null edits, and this seems like the sort of thing that might potentially happen by mistake (e.g. Module:Unsubst is playing around in similar space; although that one won't have a bad interaction with the bot, it's quite possible we'll end up creating a similar template in future and that one will cause problems). --ais523 23:06, 6 July 2022 (UTC)

    • While this task does not increase the bot's edit count, it has performed 7 other tasks and has an edit count of over 180,000 pages which should qualify as "high". wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
    • There are far more users than articles; I believe User talk: is the largest namespace and thus the most resource-intensive to purge (albeit perhaps with a smaller average page size). wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
    • The term "null edit" is used here for convenience and simplification; technically the bot purges the page cache and forces a recursive link update. This is about equivalent to a null edit, but I'm not sure that it's functionally exactly the same. – wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
      • Ah; this seems to be a significant difference. A "purge with recursive link update" on my sandbox page doesn't add a new revision, even though a null edit does. Based on this, I suspect that purging pages is lighter on the server load than an actual null edit would be, and also recommend that you use "purge with recursive link update" rather than "null edit" terminology when describing the bot. --ais523 08:32, 8 July 2022 (UTC)[reply]
        • Yes and just doing a recursive link update would be even lighter on the server load. The only reason my bot forces a purge is that there is currently no option in the API for only updating links. See this Phabricator discussion. – wbm1058 (talk) 12:42, 8 July 2022 (UTC)[reply]
    • As I started work on this project March 13, 2022 and the oldest page_links_updated date (except for the Super Six) is April 28, 2022, I believe that every page in the database older than 72 days has now been null-edited at least once, and I've yet to see any reports of problems with unintended substitution. wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
      • This is probably a consequence of the difference between purges and null edits; as long as you stick to purges it should be safe from the point of view of unintended substitution. --ais523 08:32, 8 July 2022 (UTC)[reply]
    • To make this process more efficient the bot bundles requests into groups of 20; each request sent to the server is for 20 pages to be purged at once. wbm1058 (talk) 03:38, 8 July 2022 (UTC)[reply]
  • Comment: I've worked the refreshlinks.php cutoff from 80 down to 70 days; the process may be able to hold it there. I've been trying to smooth out the load so that roughly the same number of pages are purged and link-refreshed each day. – wbm1058 (talk) 11:49, 8 July 2022 (UTC)[reply]
  • Note. This process is dependent on my computer maintaining a connection with a Toolforge bastion. Occasionally my computer becomes disconnected for unknown reasons, and when I notice this I must manually log back in to the bastion. If my computer becomes disconnected from the bastion for an extended time, this process may fall behind the expected page_links_updated dates. – wbm1058 (talk) 11:55, 12 July 2022 (UTC)[reply]
  • Another note. The purpose/objective of this task is to keep the pagelinks, categorylinks, and imagelinks tables reasonably-updated. Regenerating these tables for English Wikipedia using the rebuildall.php maintenance script is not practical for English Wikipedia due to its huge size. Even just running the RefreshLinks.php component of rebuildall is not practical due to the database size (it may be practical for smaller wikis). The goal of phab:T159512 (Add option to refreshLinks.php to only update pages that haven't been updated since a timestamp) is to make it practical to run RefreshLinks.php on English Wikipedia. My two scripts find the pages that haven't been updated since a timestamp, and then purge these pages with recursive link updates. Recursive link updates is what refreshLinks.php does. – wbm1058 (talk) 14:42, 16 July 2022 (UTC)[reply]
  • Approved for trial (30 days). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Let's see if anything breaks. Primefac (talk) 16:24, 6 August 2022 (UTC)[reply]