Document retrieval
Document retrieval is defined as the matching of some stated user query against a set of
Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines.
Description
Document retrieval systems find information to given criteria by matching text records (documents) against user queries, as opposed to
A document retrieval system has two main tasks:
- Find relevant documents to user queries
- Evaluate the matching results and sort them according to relevance, using algorithms such as PageRank.
Internet
Variations
There are two main classes of indexing schemata for document retrieval systems: form based (or word based), and content based indexing. The document classification scheme (or indexing algorithm) in use determines the nature of the document retrieval system.
Form based
Form based document retrieval addresses the exact syntactic properties of a text, comparable to substring matching in string searches. The text is generally unstructured and not necessarily in a natural language, the system could for example be used to process large sets of chemical representations in molecular biology. A suffix tree algorithm is an example for form based indexing.
Content based
The content based approach exploits semantic connections between documents and parts thereof, and semantic connections between queries and documents. Most content based document retrieval systems use an inverted index algorithm.
A signature file is a technique that creates a quick and dirty filter, for example a
Example: PubMed
The PubMed[1] form interface features the "related articles" search which works through a comparison of words from the documents' title, abstract, and MeSH terms using a word-weighted algorithm.[2][3]
See also
- Compound term processing
- Document classification
- Enterprise search
- Evaluation measures (information retrieval)
- Full text search
- Information retrieval
- Latent semantic indexing
- Search engine
References
- PMID 11825203.
- ^ Computation of Related Citations. National Center for Biotechnology Information (US). 2019-02-06.
- PMID 17971238.)
{{cite journal}}
: CS1 maint: numeric names: authors list (link
Further reading
- Faloutsos, Christos; Christodoulakis, Stavros (1984). "Signature files: An access method for documents and its analytical performance evaluation". ACM Transactions on Information Systems. 2 (4): 267–288. S2CID 8120705.
- Justin Zobel; Alistair Moffat; Kotagiri Ramamohanarao (1998). "Inverted files versus signature files for text indexing" (PDF). ACM Transactions on Database Systems. 23 (4): 453–490. S2CID 7293918.
- Ben Carterette; Fazli Can (2005). "Comparing inverted files and signature files for searching a large lexicon" (PDF). Information Processing and Management. 41 (3): 613–633. .
External links
- Formal Foundation of Information Retrieval, Buckinghamshire Chilterns University College