Wikipedia:Link rot/URL change requests/Archives/2021/May

Source: Wikipedia, the free encyclopedia.

Fix pdfs.semanticscholar.org links

The pdfs.semanticscholar.org which HTTP 301 redirect to www.semanticscholar.org URLs are actually dead links. There are quite a few now. A link to the wayback machine is possible, but I believe the InternetArchiveBot would not normally add it. Nemo 21:15, 28 April 2021 (UTC)

They are
WP:V purposes. If the citation already has an archive link it will be skipped. If no archive link can be found it will leave the URL in place and let Citation bot handle it - can generate a list of these there probably will not be many. -- GreenC
21:29, 28 April 2021 (UTC)
Makes sense, thank you! Nemo 06:42, 29 April 2021 (UTC)
Nemo, testing going well and about ready for the full run. There are a number of edge case types found that required special handing so good thing this is custom. Question: do you know if with this diff would Citation bot then keep the archive URL or remove it? -- GreenC 16:51, 29 April 2021 (UTC)
Those diffs look good. As far as I know, at the moment Citation bot is not removing those URLs; I've tested on a few articles after your bot's edits and they were left alone. Nemo 04:38, 30 April 2021 (UTC)

Nemo, looks done, let me know if you see any problems. -- GreenC 16:43, 30 April 2021 (UTC)

Thank you! Wikipedia:Link rot/cases/pdfs.semanticscholar.org is super useful. I noticed that OAbot can find more URLs to add when a DOI is available and the URL parameter is cleared. So I think I'll do another pass with OAbot by telling it to ignore the SemanticScholar URLs, and then I'll manually remove the redundant ones. Nemo 20:48, 1 May 2021 (UTC)
Actually, I'll track that at phabricator:T281631 for better visibility. Nemo 21:51, 1 May 2021 (UTC)

Results

  • Edited 2,754 articles
  • Added 3,204 new archive URLs for pdfs.semanticscholar.org
  • Add/change 74 |url-status=dead in preexisting archive URLs
  • 485 URLs no archives found: Wikipedia:Link rot/cases/pdfs.semanticscholar.org
  • Updated IABot database. Blacklisted above archived URLs while retain whitelist for remaining URLs in the domain.

TracesOfWar citations update

Wikipedia currently contains citations and source references ​​to the websites TracesOfWar.com and .nl (EN-NL bilingual), but also to the former websites ww2awards.com, go2war2.nl and oorlogsmusea.nl. However, these websites have been integrated into TracesOfWar in recent years, so that the source reference is now incorrect in hundreds of pages and a multiple of that in terms of the source references. Fortunately, there is currently the situation in which ww2awards and go2war2 still redirct to the correct page on TracesOfWar, but this is no longer the case for oorlogsmusea.nl. I have been able to correct all the sources for oorlogsmusea.nl manually. For ww2awards and go2war2 the redirects will stop in the short term, which will result in thousands of dead links, while it can be properly directed towards the same source. A short example: person Llewellyn Chilson (at Tracesofwar persons id 35010) now has a source reference to http://en.ww2awards.com/person/35010, but this must be https://www.tracesofwar.com/persons/35010/. In short, old format to new format in terms of url, but same ID.

In my opinion, that should make it possible to convert everything with format 'http://en.ww2awards.com/person/[id]' (old English) or 'http://nl.ww2awards.com/person/[id]' (old Dutch) to 'https://www.tracesofwar.com/persons/[id]' (new English) or 'https://www.tracesofwar.nl/persons/[id]' (new Dutch) respectively. The same applies to go2war2.nl, but with a different format slightly. http://www.go2war2.nl/artikel/[id] becomes https://www.tracesofwar.nl/articles/[id]. The same has already been done on the Dutch Wikipedia, via a similar bot request. Lennard87 (talk) 18:50, 29 April 2021 (UTC)

@Lennard87:, seeing around 500 mainspace URLs on enwiki for all domains combined. Can you verify not missing any? -- GreenC 22:18, 1 May 2021 (UTC)

@GreenC:, that is very well possible yes, but I have no exact numbers. In any case, those roughly 350 (go2war2+ww2awards) should be changed then to tracesofwar.com or .nl.

@Lennard87: results for www2awards it moved 251 URLs. Five examples show different types of problems: [1] [2][3][4][5] .. the variations on "WW2 Awards" and location in the cite are difficult. (BTW instead of /person/ some have /award/ which at the new site is /awards/ Example)-- GreenC 18:43, 2 May 2021 (UTC)

Results for go2war are similar it moved 48 URLs: [6][7] -- GreenC 19:26, 2 May 2021 (UTC)

@GreenC:, Thanks. Saw the situations, which are difficult, but the proposed changes are correct. Also yes, I forgot about the /award/ change; that can be applied too please. Only the last one with Gunther Josten is a difficult one, as the picture id has changed as well: https://www.mystiwot.nl/myst/upload/persons/9546061207115933p.jpg. There is no relation between the two so best to leave 'images-person' alone or use the web archive trick.

ancient.eu

Ancient History Encyclopedia has rebranded to World History Encyclopedia and moved domain to worldhistory.org. There are many references to the site across Wikipedia. All references pointing to ancient.eu should instead point to worldhistory.org. Otherwise the URL structure is the same (ie. https://www.ancient.eu/Rome/ is now https://www.worldhistory.org/Rome/). — Preceding unsigned comment added by Thamis (talkcontribs
)

Hi @Thamis:, thanks for the lead/info, this is certainly possible to do. Do you think there is a reason to consider Content Drift ie. the page at the new site is different from the original (in substance), or largely a 1:1 copy of the core content? Comparing this page with this page it looks like this is a administrative change and not a content change. -- GreenC 23:40, 20 April 2021 (UTC)
Thanks for looking into this, @GreenC:. There's no content drift, it's a 1:1 copy of the content with the exact same URLs (just the domain is different). When I compare the two Rome pages from the archive and the new domain that you linked, I see the exact same page. The same is true for any other page you might want to check. :-)

@Thamis:, this url works but this url does not. The etc.ancient.eu sub-domain did not transfer, but still works at the old site. For these it will skip as the link still works and I don't want to add an archive URL to live links if it will be transferred in the future to worldhistory.org. Can be revisited later. -- GreenC 16:03, 23 April 2021 (UTC)

@GreenC: Indeed, that etc.ancient.eu subdomain was not transferred. It's the www.ancient.eu domain that turned into www.worldhistory.org -- subdomains other than "www" should be ignored.

@Thamis: it is done. In addition to the URLs it also changed/added |work=, etc.. to World History Encyclopedia. It got about 90%, but the string "Ancient History Encyclopedia" still exists in 89 pages/cites, they will require manual work to convert (the URLs are converted only the string is not). They are mostly free-form cites with unusual formatting and would benefit from manual cleanup probably ideally conversion to {{cite encyclopedia}}. -- GreenC 01:07, 24 April 2021 (UTC)

Results

  • Edited 759 articles
  • Converted 917 URLs (Example)

@GreenC: Thanks a lot for sorting this out! Greatly appreciated. :-) — Preceding unsigned comment added by Thamis (talkcontribs)

You are welcome. If you are looking for more ideas how to improve.. converting everything to a cite template will make future maintenance easier and less error prone. However, I would not recommend creating a custom template as they are prone to breakage since they require special custom code for tools to work vs. standard cite templates which are better supported by tools. -- GreenC 18:01, 6 May 2021 (UTC)

Remove oxfordjournals.org

Hello, I think all links to oxfordjournals.org subdomains in the url parameter of {{cite journal}} should be removed, as long as there's at least a doi, pmid, pmc, or hdl parameter set. Those links are all broken, because they redirect to an HTTPS version which uses a certificate valid only for silverchair.com (example: http://jah.oxfordjournals.org/content/99/1/24.full.pdf ).

The DOI redirects to the real target URL, which nowadays is somewhere in academic.oup.com, so there's no point in keeping or adding archived URLs or url-status parameters. These URLs have been broken for years already, so it's likely they will never be fixed. Nemo 07:13, 25 April 2021 (UTC)

About 15,000. I have been admonished for removing archive URLs because of content drift ie. the page at the time of citation contains different content then the current one (academic.oup.com), therefore the archive URL is useful for showing the page at time of citation for verification purposes. OTOH if there is reason to believe content drift is not a concern for a particular domain, that is not my call to make some else would need to do that research and determine if this should be of concern. @Nemo bis: -- GreenC 16:03, 25 April 2021 (UTC)
The "version of record" is the same, so the PDF at the new website should be identical to the old one. The PubMed Central copy is generally provided by the publisher, too. So the DOI and PMC ID, if present, eliminate any risk of content drift. On the other hand, I'm pretty sure whoever added those URLs didn't mean to cite a TLS error page. :) Nemo 18:21, 25 April 2021 (UTC)
I can do this, just will need some time thanks. -- GreenC

@Nemo bis: edited 20 articles: 12345 - I forgot to remove |access-date= in a few cases. Do you see any other problems? -- GreenC 00:50, 6 May 2021 (UTC)

Looks good at first glance. I don't remember if Citoid or Citation bot are able to extract the DOI from the HTML in later stages once they can fetch the HTML from wayback machine, but either way it's good to have it. Nemo 06:01, 6 May 2021 (UTC)

@GreenC and Nemo bis: Just saw an edit about this, but the links seem to work fine now? Thanks. Mike Peel (talk) 18:39, 7 May 2021 (UTC)

What example link is working for you? -- GreenC 18:47, 7 May 2021 (UTC)
@GreenC: I tried the example link above, and the one I reverted at [8] (I assume you got a notification about that?). They both redirect fine. Thanks. Mike Peel (talk) 18:54, 7 May 2021 (UTC)
I don't know what is happening. The message Nemo and I got was:
Firefox does not trust this site because it uses a certificate that is not valid for jah.oxfordjournals.org. The certificate is only valid for the following names: *.silverchair.com, silverchair.com, gsw.contentapi.silverchair.com, dup.contentapi.silverchair.com - Error code: SSL_ERROR_BAD_CERT_DOMAIN
This is Windows 7 Firefox 88.01 - when tried with Chrome it works going to a captcha of the type "click all squares with a bus" 4 or 5 times, then it goes through to the content. Nemo, are you also using Firefox on Windows? -- GreenC 20:25, 7 May 2021 (UTC)
@GreenC: I'm using Firefox on a Mac. Please could you stop the edits until we can figure out what's going on? Thanks. Mike Peel (talk) 20:27, 7 May 2021 (UTC)
Done. -- GreenC 20:28, 7 May 2021 (UTC)
Mike, doesn't the http://jah.oxfordjournals.org/content/99/1/24.full.pdf URL redirect to https://jah.oxfordjournals.org/content/99/1/24.full.pdf and give a TLS error to you? Nemo 20:32, 7 May 2021 (UTC)
I get a PDF. The download link does start with watermark.silverchair.com though. Thanks. Mike Peel (talk) 20:35, 7 May 2021 (UTC)
Have you tried with another browser? Are you sure you haven't allowed that domain to bypass TLS security? Nemo 20:40, 7 May 2021 (UTC)
See my comment at the end of this section. I might have added an exception, since I use journal articles a lot, but I think that should only affect one browser. Have you tried doing that? Thanks. Mike Peel (talk) 20:42, 7 May 2021 (UTC)
(edit conflict) I guess that's a rare case of a domain that wasn't broken, but all the usual subdomains for journals are broken. That edit was fine anyway because the new link (to doi.org) goes to the same place and is more stable. We don't know how long the legacy OUP domains will even exist at all. Nemo 20:30, 7 May 2021 (UTC)

@Nemo bis: The first pass is done, some problems. There are cases of non-{{cite journal}} that contain DOIs etc.. Example. The bot was programmed for journal + aliases only. And I missed {{vcite journal}}. [9] There are cases of {{doi}} it's not setup to detect [10]. There were 1,750 archive URLs added, so these problems would be in that group, though most of them are fine. -- GreenC 18:45, 7 May 2021 (UTC)

Nice bot cooperation! When the URL is removed, doi-access=free can do its job properly. Direct links to PDFs on wayback are nice, links to archive.today which only serve me a captcha I don't know. I see we still have 4000 articles with oxfordjournals.org, we can probably reduce that.
The {{doi}} cases we can't do much about, need to wait for citation bot or others to transform them into structured citations. Same for the non-standard templates: sometimes people who use them are very opinionated.
I think the easiest win now is to replace some of the most commonly used citations, which are often the result of mass-creation of articles about species a decade ago. For instance a replacement similar to this would help some 300 articles:
[http://mollus.oxfordjournals.org/content/77/3/273.full  Bouchet P., Kantor Yu.I., Sysoev A. & Puillandre N. (2011) A new operational classification of the Conoidea. Journal of Molluscan Studies 77: 273–308.]
→
{{cite journal|first1=P.|last1=Bouchet|first2=Y. I.|last2=Kantor|first3=A.|last3=Sysoev|first4=N.|last4=Puillandre|title=A new operational classification of the Conoidea (Gastropoda)|journal=Journal of Molluscan Studies|date=1 August 2011|pages=273–308|volume=77|issue=3|doi=10.1093/mollus/eyr017|url=https://archimer.ifremer.fr/doc/00144/25544/23686.pdf}}
(You can probably match any text between two ref tags or between a bullet and a newline which matches /content/77/3/273.) I just converted the DOI to a {{cite journal}} syntax with VisualEditor/Citoid and added what OAbot would have done. There are a few cases like this, you probably can find them from the IAbot database or from a query of the remaining links. These are the most common IDs among them:
$ curl -s https://quarry.wmflabs.org/run/550945/output/1/tsv | grep -Eo "(/[0-9]+){3}" | sort | uniq -c | sort -nr | head
69 /21/7/1361
60 /22/10/1964
51 /77/3/273
29 /24/6/1300
28 /19/7/1008
25 /24/20/2339
21 /55/6/912
17 /22/2/189
16 /19/1/2
15 /11/3/257
Nemo 20:30, 7 May 2021 (UTC)
Testing [11] in Safari, Chrome, and Firefox on a Mac, no problems, it redirects fine... Thanks. Mike Peel (talk) 20:32, 7 May 2021 (UTC)
We can definitely replicate the problem on two computers (my own and Nemo) so good chance it is happening for others. There is also, which is better, with the URL or without? With the URL (assuming you get through) it asks for a captcha which is somewhat difficult to get past, and it's a link to a site that is vendor specific. Without the URL it goes to doi.org - long term reliable - and opens the PDF without a captcha, or potential SSL problems. Comparing before and after deletion, the citation has been improved IMO. -- GreenC 20:47, 7 May 2021 (UTC)
Do you have HTTPS Everywhere? I see the http://mollus.oxfordjournals.org/content/77/3/273.full redirects directly https://academic.oup.com/mollus/article/77/3/273/1211552 without, but if the redirect is to https://mollus.oxfordjournals.org/content/77/3/273.full then nothing works because this URL is served incorrectly.
Anyway, this was just one of many issues with those old oxfordjournals.org URLs: there are also pmid URLs which don't go anywhere, URLs which redirect to the mainpage of the respective journal and so on. When we have a DOI there's no reason to keep them, they're ticking bombs even if they just happen to work for now. Nemo 20:53, 7 May 2021 (UTC)
I do have HTTPS Everywhere, turning it off got through (to a captcha). That should not happen. It would be an improvement to replace with DOI URLs when available. -- GreenC 21:14, 7 May 2021 (UTC)
Ok, I've sent a patch for the ruleset. Nevertheless I recommend to proceed with the cleanup because we're never going to be able to babysit the fate of 390 legacy domains. I'm listing at User:Nemo bis/Sandbox some suggestion for more specific replacements. (Some URLs need to be searched in all their variants, especially that first one "77/3/273".) Nemo 21:38, 7 May 2021 (UTC)
(edit conflict) GreenC, would you kindly stop your bot from doing this? You are removing working links for no reason. Here you removed http://jhered.oxfordjournals.org/content/30/12/549.extract, but that link works just fine, and redirects effortlessly to https://academic.oup.com/jhered/article-abstract/30/12/549/911170. If it isn't broken there's (really!) no need to mend it (and even less to break it). If you want to replace the old link with the new one that's fine with me (I've already done a few), but please stop removing working links. Thanks, Justlettersandnumbers (talk) 21:44, 7 May 2021 (UTC)
Well that URL is broken for a few million users at the moment, so there is a reason to remove it. One alternative is to replace it with a doi.org URL if there is no doi-access=free or PMC parameter yet. Nemo 21:57, 7 May 2021 (UTC)

Have you tried contacting the journal about these issues? Since the links *do* work (possibly unless you apply extra restrictions), I don't think these removals should be happening without asking the wider community first. Thanks. Mike Peel (talk) 08:28, 8 May 2021 (UTC)

OUP is notoriously impervious to pleas that they fix URLs or even DOIs. There's no point trying. Nemo 09:55, 8 May 2021 (UTC)

Dead links redundant with permanent links

Related to #Fix pdfs.semanticscholar.org links, or rather the work that followed it at phabricator:T281631, there are a few hundreds {{dead link}} notices which can be removed (together with the associated URL) because the DOI or HDL can be expected to provide the canonical permanent link. See a simple search at:

This is not nearly as urgent as the OUP issue above, and if it's complicated I may also do it manually, but it seems big enough to benefit from a bot run at some point. Nemo 16:26, 5 May 2021 (UTC)

To confirm, if a cite template contains |doi-access=free or |hdl-access=free and has a {{dead link}} attached, remove the {{dead link}} (plus {{cbignore}}) and the |url=.-- GreenC 20:11, 5 May 2021 (UTC)
Yes. Also a pmc. Nemo 20:58, 5 May 2021 (UTC)
|pmid= ? -- GreenC 18:03, 6 May 2021 (UTC)
IMHO not, because the PMID alone doesn't provide the full text, so the original URL might have had something different. The reason PMID is sufficient with the OUP links above is that PubMed links the same publisher landing page as the original URL. Nemo 05:54, 7 May 2021 (UTC)

SR/Olympics templates

Hello. As SR/Olympics has been shut down, several SR/Olympics templates are broken. They are Template:SR/Olympics country at games (250 usages), Template:SR/Olympics sport at games and Template:SR/Olympics sport at games/url (both 63 usages). See for example Algeria at the 2012 Summer Olympics and Football at the 2012 Summer Olympics. I'm not sure if InternetArchiveBot can work with these templates. I was wondering how these links could be fixed with archived URLS like at Template:Sports reference. Thanks! --MrLinkinPark333 (talk) 19:35, 10 May 2021 (UTC)

The first two already have an |archive= argument so it's just a matter of updating each instance with a 14-digit timestamp eg. |archive=20161204010101. The last one is used by the second one which is why it has the same count, nothing to do there. For the first two, I guess it would require some custom code to find a working timestamp and add it. This is why I dislike custom templates, they don't work with standard tools, each instance a custom programming job. I'll see what I can do. -- GreenC 20:04, 10 May 2021 (UTC)

Reuters

The new Reuters website redirected all subdomains to www.reuters.com and broke all links. That's about 50k articles on the English Wikipedia alone, I believe. I see that the domain is whitelisted on InternetArchiveBot, not sure whether that's intended. Nemo 20:13, 1 May 2021 (UTC)

Wow that's major. Domains can become auto-whitelisted if the bot is receiving confusing messages by way of user reverts (of the bot). Looks like some subdomains still work [12]. Or correctly return 404 and would be picked up by IABot - except for the whitelist [13]. Or soft 404'ing [14]. How to determine a soft404 is an art, in this case easy enough it redirects to a page with a title "Homepage" but there are probably other unknown landing locations. WaybackMedic should be able to do this, it has good code for following redirects, checking headers and verifying (known) soft404s. Will not be able to start for at least a week to catch up on other things. Then will take a while due to the size. -- GreenC 21:59, 1 May 2021 (UTC)

Thanks. I count 249299 links to reuters.com across ~all wikis at the moment (phabricator:P15671). Nemo 08:06, 2 May 2021 (UTC)
Interesting how spread out they are, except enwiki. There is probably a rule similar to 80/40 but more like 40/60 or 33/66 (enwiki/everything else)-- GreenC 15:13, 2 May 2021 (UTC)

Reuters is complete. Typical entry in the IABot database: [15] ie. GreenC bot detected the URL is dead and set to blacklisted. Different case: [16] - set Blacklisted, add an archive URL when available. Third type, removes the archive URL if it is not working and no replacement. In total it checked 165k unique URLs and edited about 69k or about 42% are now blacklisted, the rest still work (Example). Next step would be run IABot on all the pages with a reuters URL (any hostname and .com and .co.uk) on any wiki language sites supported; or, IABot will find them in time. -- GreenC 01:52, 21 May 2021 (UTC)

Broken amazonaws links

I frequently find "expired" links to amazonaws on Wikipedia: is it possible to automatically repair them? Jarble (talk) 18:31, 21 May 2021 (UTC)

@Jarble: The search shows around 370. The few I looked at manually don't have an archive available. For example [17] from Sardinian language contains Expires=1612051241 appears to be a unix date which can be converted here to January 2021. And sure enough the archive post-dates it from April and doesn't work, returning a 4xx code. So it would need a {{dead link}}. In fact even if it had not expired yet, it probably will soon enough, so should be treated as "dead" and an archive found and added ASAP. We currently have no mechanism for automating search+archive processes on certain URLs on a recurring basis. It wouldn't be difficult to processes these 370 but going forward as new one's are added, I'll need to think about it. -- GreenC 19:02, 21 May 2021 (UTC)

Ok, I created a bot that will search for the URLs daily, and if it finds a URL not seen before, issue a Save Page Now at Wayback. This should ensure any new additions will be immediately archived. Assuming Wayback is even capable of archiving them. The bot has public logs https://tools-static.wmflabs.org/botwikiawk/awsexp/ and the source is there also. Once the URLs are archived then it's just a matter of IABot or some other process to add the archives into Wiki at our leisure regardless if past the expiration date. -- GreenC 20:26, 21 May 2021 (UTC)

An AWS link was added 12:31 pm, May 17, 2021 with an &Expires= timestamp of 13:30:53 pm or about 1 hour later. Holy cow that is a short expiration period. No wonder these links are broken and have no archive URLs available. The bot would need to run every 50 minutes or so, assuming there are none with even shorter expiration. -- GreenC 21:17, 21 May 2021 (UTC)

The bot now runs twice an hour. Keeping an eye on the logs, to see if it can successfully save the page before it expires. -- GreenC 21:53, 21 May 2021 (UTC)
Another came through, it was moved into main from Draft so it had already expired. Added Draft: and File: to the search. -- GreenC 02:59, 22 May 2021 (UTC)

Jarble, I've processed the links in 584 articles and converted most of them to a {{dead link}}, or added a working archive.org -- it was difficult because my bot is not designed to find URLs by key words in the path &Expires= vs. a specific domain name which could be anything in this case. As such, there were 31 cases it could not determine, and rather than debug and test why, it will be faster to fix them manually. I have too much other work to do, and so am moving on. In case you want to fix them, they are in the following articles:

-- GreenC 19:57, 24 May 2021 (UTC)

Estradiol hormones - External link not working - Dead link

The article about Estradiol as a hormones contains an non working external link in the references. On the reference number 71., the last external link as a PDF is citing the values from this source "Establishment of detailed reference values for luteinizing hormone, follicle stimulating hormone, estradiol, and progesterone during different phases of the menstrual cycle on the Abbott ARCHITECT analyzer".

This external link redirects to an 404 server error and needs to be replaces with an operating link. The original research document is available on the laboratory's website.

How to change this link? I don't know how to use a bot. I'm thankful for any help. — Preceding unsigned comment added by Jerome.lab (talkcontribs) 13:11, 30 April 2021 (UTC)

Instructions are at
WP:URLREQ. Primefac (talk
) 13:18, 30 April 2021 (UTC)
Moved from
talk
) 23:23, 24 May 2021 (UTC)

Fixed. You can find archive URLs at archive.org and replace them in the article when the link is dead. -- GreenC 00:53, 25 May 2021 (UTC)