Spider trap
A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a
Common techniques used are:
- creation of indefinitely deep directorystructures like
http://example.com/bar/foo/bar/foo/bar/foo/bar/...
- Dynamic pages that produce an unbounded number of documents for a web crawler to follow. Examples include calendarslanguage poetry.[2]
- documents filled with many characters, crashing the lexical analyzer parsing the document.
- documents with session-id's based on required cookies.
There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.
Politeness
A spider trap causes a web crawler to enter something like an infinite loop,[3] which wastes the spider's resources,[4] lowers its productivity, and, in the case of a poorly written crawler, can crash the program. Polite spiders alternate requests between different hosts, and do not request documents from the same server more than once every several seconds,[5] meaning that a "polite" web crawler is affected to a much lesser degree than an "impolite" crawler.[citation needed]
In addition, sites with spider traps usually have a
See also
- Robots exclusion standard
- Web crawler
References
- ^ ""What is a Spider Trap?"". Techopedia. 27 November 2017. Retrieved 2018-05-29.
- ^ Neil M Hennessy. "The Sweetest Poison, or The Discovery of L=A=N=G=U=A=G=E Poetry on the Web". Accessed 2013-09-26.
- ^ "Portent". Portent. 2016-02-03. Retrieved 2019-10-16.
- ^ "How to Set Up a robots.txt to Control Search Engine Spiders (thesitewizard.com)". www.thesitewizard.com. Retrieved 2019-10-16.
- ^ "Building a Polite Web Crawler". The DEV Community. 13 April 2019. Retrieved 2019-10-16.
- ^ Group, J. Media (2017-10-12). "Closing a spider trap: fix crawl inefficiencies". J Media Group. Retrieved 2019-10-16.