Stop word

Stop words are the words in a stop list (or stoplist or negative dictionary) which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are deemed insignificant.^[1] There is no single universal list of stop words used by all natural language processing tools, nor any agreed upon rules for identifying stop words, and indeed not all tools even use such a list. Therefore, any group of words can be chosen as the stop words for a given purpose. The "general trend in [information retrieval] systems over time has been from standard use of quite large stop lists (200–300 terms) to very small stop lists (7–12 terms) to no stop list whatsoever".^[2]

History of stop words

A predecessor concept was used in creating some concordances. For example, the first Hebrew concordance, Isaac Nathan ben Kalonymus's Me’ir Nativ, contained a one-page list of unindexed words, with nonsubstantive prepositions and conjunctions which are similar to modern stop words.^[3]

Hans Peter Luhn, one of the pioneers in information retrieval, is credited with coining the phrase and using the concept when introducing his Keyword-in-Context automatic indexing process.^[4] The phrase "stop word", which is not in Luhn's 1959 presentation, and the associated terms "stop list" and "stoplist" appear in the literature shortly afterward.^[5]

Although it is commonly assumed that stoplists include only the most frequent words in a language, it was C.J. Van Rijsbergen who proposed the first standardized list which was not based on word frequency information. The "Van list" included 250 English words. Martin Porter's word stemming program developed in the 1980s built on the Van list, and the Porter list is now commonly used as a default stoplist in a variety of software applications.

In 1990, Christopher Fox proposed the first general stop list based on empirical word frequency information derived from the Brown Corpus:

This paper reports an exercise in generating a stop list for general text based on the Brown corpus of 1,014,000 words drawn from a broad range of literature in English. We start with a list of tokens occurring more than 300 times in the Brown corpus. From this list of 278 words, 32 are culled on the grounds that they are too important as potential index terms. Twenty-six words are then added to the list in the belief that they may occur very frequently in certain kinds of literature. Finally, 149 words are added to the list because the finite state machine based filter in which this list is intended to be used is able to filter them at almost no cost. The final product is a list of 421 stop words that should be maximally efficient and effective in filtering the most frequently occurring and semantically neutral words in general literature in English.^[6]

In SEO terminology, stop words are the most common words that many search engines used to avoid for the purposes of saving space and time in processing of large data during crawling or indexing.

For some

lexical words, such as "want"—from a query in order to improve performance.^[7]

In recent years the SEO best practices around stop words have evolved along with the fields of

To be or not to be' just is a collection of stop words, but stop words alone don't do it any justice."^[8]^[9]

References

ISBN 9781139058452
.

^ Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. p. 27.{{cite book}}: CS1 maint: multiple names: authors list (link)

^ Weinberg, Bella Hass (2004). "Predecessors of scientific indexing structures in the domain of religion" (PDF). Second Conference on the History and Heritage of Scientific and Technical Information Systems: 126–134. Archived from the original (PDF) on 3 Jan 2016. Retrieved 17 February 2016.

doi:10.1002/asi.5090110403
.

doi:10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A
.

S2CID 20240000
.

^ Stackoverflow: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".

^ "Google: Stop Worrying About Stop Words Just Write Naturally". seroundtable.com. Retrieved 2022-07-15.

^ John, Mueller (Feb 6, 2021). "John Mueller on stop words in 2021: "I wouldn't worry about stop words at all"". Twitter. Retrieved July 15, 2022.

External links

Full-Text Stopwords in MySQL

English Stop Words (CSV)

Stop Words Indonesia Query PHP Array

German Stop Words,German Stop Words and phrases, another list of German stop words

Polish Stop Words

Collection of stop words in 29 languages (archive)

List of Hindi Stop Words

v
t
e
Natural language processing
General terms

AI-complete

Bag-of-words

n-gram
Bigram

Trigram

Computational linguistics

Natural-language understanding

Stop words

Text processing

Text analysis

Argument mining

Collocation extraction

Concept mining

Coreference resolution

Deep linguistic processing

Distant reading

Information extraction

Named-entity recognition

Ontology learning

Parsing
Semantic parsing

Syntactic parsing

Part-of-speech tagging

Semantic analysis

Semantic role labeling

Semantic decomposition

Semantic similarity

Sentiment analysis

Terminology extraction

Text mining

Textual entailment

Truecasing

Word-sense disambiguation

Word-sense induction

Text segmentation

Compound-term processing

Lemmatisation

Lexical analysis

Text chunking

Stemming

Sentence segmentation

Word segmentation

Automatic summarization

Multi-document summarization

Sentence extraction

Text simplification

Machine translation

Computer-assisted

Example-based

Rule-based

Statistical

Transfer-based

Neural

Distributional semantics models

BERT

Document-term matrix

Explicit semantic analysis

fastText

GloVe

Language model (large)

Latent semantic analysis

Seq2seq

Word embedding

Word2vec

Language resources,
datasets and corpora
Types and
standards

Corpus linguistics

Lexical resource

Linguistic Linked Open Data

Machine-readable dictionary

Parallel text

PropBank

Semantic network

Simple Knowledge Organization System

Speech corpus

Text corpus

Thesaurus (information retrieval)

Treebank

Universal Dependencies

Data

BabelNet

Bank of English

DBpedia

FrameNet

Google Ngram Viewer

UBY

WordNet

Automatic identification
and data capture

Speech recognition

Speech segmentation

Speech synthesis

Natural language generation

Optical character recognition

Topic model

Document classification

Latent Dirichlet allocation

Pachinko allocation

Computer-assisted
reviewing

Automated essay scoring

Concordancer

Grammar checker

Predictive text

Pronunciation assessment

Spell checker

Syntax guessing

Natural language
user interface

Chatbot

Interactive fiction

Question answering

Virtual assistant

Voice user interface

Related

Formal semantics

Hallucination

Natural Language Toolkit

spaCy

v
t
e
Search engine optimization
Exclusion standards

Robots exclusion standard

Meta element

nofollow

Marketing topics

Online advertising

Email marketing

Display advertising

Web analytics

Search marketing

Search engine marketing

Social media optimization

Online identity management

Paid inclusion

Pay per click

Google bomb

Search engine spam

Spamdexing

Web scraping

Scraper site

Link farm

Link building

Linking

Backlink

Link building

Link exchange

Organic linking

People

Danny Sullivan

Matt Cutts

Barry Schwartz

Other

Geotargeting

Human search engine

Stop words

Content farm

Retrieved from "https://en.wikipedia.org/w/index.php?title=Stop_word&oldid=1216542266"

[1] ISBN 9781139058452
.

[2] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2008). Introduction to Information Retrieval. Cambridge University Press. p. 27.{{cite book}}: CS1 maint: multiple names: authors list (link)

[3] Weinberg, Bella Hass (2004). "Predecessors of scientific indexing structures in the domain of religion" (PDF). Second Conference on the History and Heritage of Scientific and Technical Information Systems: 126–134. Archived from the original (PDF) on 3 Jan 2016. Retrieved 17 February 2016.

[4] :10.1002/asi.5090110403
.

[5] :10.1002/(SICI)1097-4571(1999)50:12<1066::AID-ASI5>3.0.CO;2-A
.

[6] S2CID 20240000
.

[7] Stackoverflow: "One of our major performance optimizations for the "related questions" query is removing the top 10,000 most common English dictionary words (as determined by Google search) before submitting the query to the SQL Server 2008 full text engine. It’s shocking how little is left of most posts once you remove the top 10k English dictionary words. This helps limit and narrow the returned results, which makes the query dramatically faster".

[8] "Google: Stop Worrying About Stop Words Just Write Naturally". seroundtable.com. Retrieved 2022-07-15.

[9] John, Mueller (Feb 6, 2021). "John Mueller on stop words in 2021: "I wouldn't worry about stop words at all"". Twitter. Retrieved July 15, 2022.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

History of stop words

See also

References

External links