Sentence boundary disambiguation
Appearance
Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in
.Some languages including Japanese and Chinese have unambiguous sentence-ending markers.
Strategies
The standard '
vanilla' approach to locate the end of a sentence:[clarification needed
]
- (a) If it is a period, it ends a sentence.
- (b) If the preceding token is in the hand-compiled list of abbreviations, then it does not end a sentence.
- (c) If the next token is capitalized, then it ends a sentence.
This strategy gets about 95% of sentences correct..hack//SIGN") and usage of non-standard punctuation (or non-standard usage of punctuation) in a text often fall under the remaining 5%.
Another approach is to automatically learn a set of rules from a set of documents where the sentence breaks are pre-marked. Solutions have been based on a
architecture uses a neural network to disambiguate sentence boundaries and achieves 98.5% accuracy.Software
- Examples of use of Perl compatible regular expressions ("PCRE")
-
((?<=[a-z0-9][.?!])|(?<=[a-z0-9][.?!]\"))(\s|\r\n)(?=\"?[A-Z])
$sentences = preg_split("/(?<!\..)([\?\!\.]+)\s(?!.\.)/", $text, -1, PREG_SPLIT_DELIM_CAPTURE);
(for PHP)
- Online use, libraries, and APIs
- Toolkits that include sentence detection
-
- Apache OpenNLP[8]
- Freeling (software)[9]
- Natural Language Toolkit[10]
- Stanford NLP[11]
- GExp[12]
- CogComp-NLP[13]
See also
- Multiword expression
- Punctuation
- Sentence extraction
- Sentence spacing
- Speech segmentation
- Syllabification
- Text segmentation
- Translation memory
- Word divider
References
- ^ E. Stamatatos; N. Fakotakis & G. Kokkinakis. "Automatic extraction of rules for sentence boundary disambiguation". Proceedings of the Workshop on Machine Learning in Human Language Technology. University of Patras. pp. 88–92.
- ^ O'Neil, John. "Doing Things with Words, Part Two: Sentence Boundary Detection". Archived from the original on 2009-02-21. Retrieved 2009-01-03.
- ^ Reynar, JC; Ratnaparkhi, A. "A Maximum Entropy Approach to Identifying Sentence Boundaries" (PDF). Retrieved 2009-01-03.
- ^ a b "SATZ: An Adaptive Sentence Boundary Detector". Archived from the original on 2007-09-22.
- ^ "SentParBreaker Web page". Archived from the original on 2007-11-12.
- ^ "Lingua-EN-Sentence-0.25 - Module for splitting text into sentences. - metacpan.org". metacpan.org.
- ^ "Text::Sentence - module for splitting text into sentences - metacpan.org". metacpan.org.
- ^ "Apache OpenNLP". opennlp.apache.org.
- ^ "Welcome | FreeLing Home Page".
- ^ "NLTK :: Natural Language Toolkit". www.nltk.org.
- ^ "Software - The Stanford Natural Language Processing Group". nlp.stanford.edu.
- ^ "Google Code Archive - Long-term storage for Google Code Project Hosting". code.google.com.
- ^ "CogCompNLP". January 2, 2024 – via GitHub.