Search engine (computing)
It has been suggested that this article be merged into Search engine. (Discuss) Proposed since September 2023. |
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)
|
In
A search engine normally consists of four components, as follows: a search interface, a crawler (also known as a spider or bot), an indexer, and a database. The crawler traverses a document collection, deconstructs document text, and assigns surrogates for storage in the search engine index. Online search engines store images, link data and metadata for the document.
How search engines work
Search engines provide an
The list of items that meet the criteria specified by the query is typically sorted, or ranked. Ranking items by relevance (from highest to lowest) reduces the time required to find the desired information.
To provide a set of matching items that are sorted according to some criteria quickly, a search engine will typically collect
Other types of search engines do not store an index.
Database size, which had been a significant marketing feature through the early 2000s, was similarly displaced by emphasis on relevancy ranking, the methods by which search engines attempt to sort the best results first. Relevancy ranking first became a major issue c. 1996, when it became apparent that it was impractical to review full lists of results. Consequently,
Search engine experience for users continues to be enhanced. Google's addition of the
Search engine categories
Web search engines
Search engines that are expressly designed for searching web pages, documents, and images were developed to facilitate searching through a large, nebulous blob of unstructured resources. They are engineered to follow a multi-stage process: crawling the infinite stockpile of pages and documents to skim the figurative foam from their contents, indexing the foam/buzzwords in a sort of semi-structured form (database or something), and at last, resolving user entries/queries to return mostly relevant results and links to those skimmed documents or pages from the inventory.
Crawl
In the case of a wholly textual search, the first step in classifying web pages is to find an ‘index item’ that might relate expressly to the ‘search term.’ In the past, search engines began with a small list of URLs as a so-called seed list, fetched the content, and parsed the links on those pages for relevant information, which subsequently provided new links. The process was highly cyclical and continued until enough pages were found for the searcher's use. These days, a continuous crawl method is employed as opposed to an incidental discovery based on a seed list. The crawl method is an extension of aforementioned discovery method.
Most search engines use sophisticated scheduling algorithms to “decide” when to revisit a particular page, to appeal to its relevance. These algorithms range from constant visit-interval with higher priority for more frequently changing pages to adaptive visit-interval based on several criteria such as frequency of change, popularity, and overall quality of site. The speed of the web server running the page as well as resource constraints like amount of hardware or bandwidth also figure in.
Link map
Pages that are discovered by web crawls are often distributed and fed into another computer that creates a map of resources uncovered. The bunchy clustermass looks a little like a graph, on which the different pages are represented as small nodes that are connected by links between the pages. The excess of data is stored in multiple data structures that permit quick access to said data by certain algorithms that compute the popularity score of pages on the web based on how many links point to a certain web page, which is how people can access any number of resources concerned with diagnosing psychosis. Another example would be the accessibility/rank of web pages containing information on Mohamed Morsi versus the very best attractions to visit in Cairo after simply entering ‘Egypt’ as a search term. One such algorithm, PageRank, proposed by Google founders Larry Page and Sergey Brin, is well known and has attracted a lot of attention because it highlights repeat mundanity of web searches courtesy of students that don't know how to properly research subjects on Google.
The idea of doing link analysis to compute a popularity rank is older than PageRank. However, In October 2014, Google’s John Mueller confirmed that Google is not going to be updating it (Page Rank) going forward. Other variants of the same idea are currently in use – grade schoolers do the same sort of computations in picking kickball teams. These ideas can be categorized into three main categories: rank of individual pages and nature of web site content. Search engines often differentiate between internal links and external links, because web content creators are not strangers to shameless self-promotion. Link map data structures typically store the anchor text embedded in the links as well, because anchor text can often provide a “very good quality” summary of a web page's content.
Database Search Engines
Searching for text-based content in databases presents a few special challenges from which a number of specialized search engines flourish. Databases can be slow when solving complex queries (with multiple logical or string matching arguments). Databases allow pseudo-logical queries which full-text searches do not use. There is no crawling necessary for a database since the data is already structured. However, it is often necessary to index the data in a more economized form to allow a more expeditious search.
Mixed Search Engines
Sometimes, data searched contains both database content and web pages or documents. Search engine technology has developed to respond to both sets of requirements. Most mixed search engines are large Web search engines, like Google. They search both through structured and unstructured data sources. Take for example, the word ‘ball.’ In its simplest terms, it returns more than 40 variations on Wikipedia alone. Did you mean a ball, as in the social gathering/dance? A soccer ball? The ball of the foot? Pages and documents are crawled and indexed in a separate index. Databases are indexed also from various sources. Search results are then generated for users by querying these multiple indices in parallel and compounding the results according to “rules.”
History of Search Technology
The Memex
This section may contain an excessive amount of intricate detail that may interest only a particular audience. Specifically, hypertext-based search ranking applies to a small subset of search technology; details about the history and implementation of Memex are not relevant.. |
The concept of hypertext and a memory extension originates from an article that was published in
Bush regarded the notion of “associative indexing” as his key conceptual contribution. As he explained, this was “a provision whereby any item may be caused at will to select immediately and automatically another. This is the essential feature of the memex. The process of tying two items together is the important thing.[7]
All of the documents used in the memex would be in the form of microfilm copy acquired as such or, in the case of personal records, transformed to microfilm by the machine itself. Memex would also employ new retrieval techniques based on a new kind of associative indexing the basic idea of which is a provision whereby any item may be caused at will to select immediately and automatically another to create personal "trails" through linked documents. The new procedures, that Bush anticipated facilitating information storage and retrieval would lead to the development of wholly new forms of the encyclopedia.
The most important mechanism, conceived by Bush, is the associative trail. It would be a way to create a new linear sequence of microfilm frames across any arbitrary sequence of microfilm frames by creating a chained sequence of links in the way just described, along with personal comments and side trails.
In 1965 Bush took part in the project INTREX of MIT, for developing technology for mechanization the processing of information for library use. In his 1967 essay titled "Memex Revisited", he pointed out that the development of the digital computer, the transistor, the video, and other similar devices had heightened the feasibility of such mechanization, but costs would delay its achievements.[8]
SMART
Gerard Salton, who died on August 28 of 1995, was the father of modern search technology. His teams at Harvard and Cornell developed the SMART informational retrieval system. Salton's Magic Automatic Retriever of Text included important concepts like the
He authored a 56-page book called A Theory of Indexing which explained many of his tests, upon which search is still largely based.
String Search Engines
In 1987 an article was published detailing the development of a character string search engine (SSE) for rapid text retrieval on a double-metal 1.6-μm n-well CMOS solid-state circuit with 217,600 transistors lain out on a 8.62x12.76-mm die area. The SSE accommodated a novel string-search architecture which combines a 512-stage finite-state automaton (FSA) logic with a content addressable memory (CAM) to achieve an approximate string comparison of 80 million strings per second. The CAM cell consisted of four conventional static RAM (SRAM) cells and a read/write circuit. Concurrent comparison of 64 stored strings with variable length was achieved in 50 ns for an input text stream of 10 million characters/s, permitting performance despite the presence of single character errors in the form of character codes. Furthermore, the chip allowed nonanchor string search and variable-length `don't care' (VLDC) string search.[9]
See also
By source
- Database search engine
- Desktop search
- Enterprise search
- Federated search
- Human search engine
- Metasearch engine
- Multisearch
- Search aggregator
- Web search engine
By content type
- Audio search engine
- Full text search
- Image search
- Video search engine
By interface
By topic
Others
- Automatic summarization
- Emanuel Goldberg (inventor of early search engine)
- Index (search engine)
- Inverted index
- List of search engines
- Search as a service
- Search engine indexing
- Search engine optimization
- Search suggest drop-down list
- Solver (computer science)
- Spamdexing
- SQL
- Text mining
- Web crawler
- Word-sense disambiguation (dealing with Ambiguity)
References
- ^ "The Seven Ages of Information there are may many ways Retrieval". Retrieved 1 June 2014.
- ^ Voorhees, E.M. Natural Language Processing and Information Retrieval[permanent dead link]. National Institute of Standards and Technology. March 2000.
- ^ "Internet Basics: Using Search Engines". GCFGlobal.org. Retrieved 2022-07-11.
- ISBN 978-1-4165-4696-2. Retrieved 9 December 2012.
- ^ "What do we make of Wikipedia's falling traffic?". The Daily Dot. 2014-01-08. Retrieved 2020-11-01.
- S2CID 2378301.
- S2CID 2378301The example Bush gives is a quest to find information on the relative merits of the Turkish short bow and the English long bow in the crusades)
{{cite journal}}
: CS1 maint: postscript (link - ^ "The MEMEX of Vannevar Bush". 4 January 2021.
- .