Linguistic categories
Linguistic categories include
- Lexical category, a part of speech such as noun, preposition, etc.
- Syntactic category, a similar concept which can also include phrasal categories
- Grammatical category, a grammatical feature such as tense, gender, etc.
The definition of linguistic categories is a major concern of
Linguistic category inventories
To facilitate the
Part-of-Speech tagsets
Schools commonly teach that there are 9
In part-of-speech tagging by computer, it is typical to distinguish from 50 to 150 separate parts of speech for English. POS tagging work has been done in a variety of languages, and the set of POS tags used varies greatly with language. Tags usually are designed to include overt morphological distinctions, although this leads to inconsistencies such as case-marking for pronouns but not nouns in English, and much larger cross-language differences. The tag sets for heavily inflected languages such as
The most popular "tag set" for POS tagging for American English is probably the Penn tag set, developed in the Penn Treebank project.
Multilingual annotation schemes
For Western European languages, cross-linguistically applicable annotation schemes for parts-of-speech, morphosyntax and syntax have been developed with the EAGLES Guidelines. The "Expert Advisory Group on Language Engineering Standards" (EAGLES) was an initiative of the
- Large-scale language resources (such as text corpora, computational lexicons and speech corpora);
- Means of manipulating such knowledge, via computational linguistic formalisms, mark up languages and various software tools;
- Means of assessing and evaluating resources, tools and products.
The Eagles guidelines have inspired subsequent work on other regions, as well, e.g., Eastern Europe.[4]
A generation later, a similar effort was initiated by the research community under the umbrella of Universal Dependencies. Petrov et al.[5][6] have proposed a "universal", but highly reductionist, tag set, with 12 categories (for example, no subtypes of nouns, verbs, punctuation, etc.; no distinction of "to" as an infinitive marker vs. preposition (hardly a "universal" coincidence), etc.). Subsequently, this was complemented with cross-lingual specifications for dependency syntax (Stanford Dependencies),[7] and morphosyntax (Interset interlingua,[8] partially building on the Multext-East/Eagles tradition) in the context of the Universal Dependencies (UD), an international cooperative project to create treebanks of the world's languages with cross-linguistically applicable ("universal") annotations for parts of speech, dependency syntax, and (optionally) morphosyntactic (morphological) features. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The annotation scheme has it roots in three related projects: The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At as of February 2019, there are just over 100 treebanks of more than 70 languages available in the UD inventory.[9] The project's primary aim is to achieve cross-linguistic consistency of annotation. However, language-specific extensions are permitted for morphological features (individual languages or resources can introduce additional features). In a more restricted form, dependency relations can be extended with a secondary label that accompanies the UD label, e.g., aux:pass for an auxiliary (UD aux) used to mark passive voice.[10]
The Universal Dependencies have inspired similar efforts for the areas of inflectional morphology,
Conventions for interlinear glosses
In linguistics, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines (inter- + linear), such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more lines of transcription known as an interlinear text or interlinear glossed text (IGT)—interlinear for short. Such glosses help the reader follow the relationship between the source text and its translation, and the structure of the original language. There is no standard inventory for glosses, but common labels are collected in the Leipzig Glossing Rules.[25] Wikipedia also provides a List of glossing abbreviations that draws on this and other sources.
General Ontology for Linguistic Description (GOLD)
GOLD ("General Ontology for Linguistic Description") is an
GOLD was maintained by the
ISO 12620 (ISO TC37 Data Category Registry, ISOcat)
ISO 12620 is a standard from ISO/TC 37 that defines a Data Category Registry, a registry for registering linguistic terms used in various fields of translation, computational linguistics and natural language processing and defining mappings both between different terms and the same terms used in different systems.[28][29][30]
An earlier implementation of this standard, ISOcat, provides persistent identifiers and URIs for linguistic categories, including the inventory of the GOLD ontology (see below). The goal of the registry is that new systems can reuse existing terminology, or at least be easily mapped to existing terminology, to aid interoperability.[31] The standard is used by other standards such as Lexical Markup Framework (ISO 24613:2008), and a number of terminologies have been added to the registry, including the Eagles guidelines, the National Corpus of Polish, and the TermBase eXchange format from the Localization Industry Standards Association.
However, the current edition ISO 12620:2019[32] does no longer provide a registry of terms for language technology and terminology, but it is now restricted to terminology resources, hence the revised title "Management of terminology resources — Data category specifications". Accordingly, ISOcat is no longer actively developed.[33] As of May 2020, successor systems, CLARIN Concept Registry[34] and DatCatInfo[35] are only emerging.
For linguistic categories relevant to lexical resources, the lexinfo vocabulary represents an established community standard,[36] in particular in connection with the OntoLex vocabulary and machine-readable dictionaries in the context of Linguistic Linked Open Data technologies. Like the OntoLex vocabulary builds on the Lexical Markup Framework (LMF), lexinfo builds on (the LMF section of) ISOcat.[37] Unlike ISOcat, however, lexinfo is actively maintained and currently (May 2020) extended in a community effort.[38]
Ontologies of Linguistic Annotation (OLiA)
Similar in spirit to GOLD, the Ontologies of Linguistic Annotation (OLiA) provide a reference inventory of linguistic categories for syntactic, morphological and semantic phenomena relevant for
In addition to annotation schemes, the OLiA Reference Model is also linked with the Eagles Guidelines,[40] GOLD,[40] ISOcat,[41] CLARIN Concept Registry,[42] Universal Dependencies,[43] lexinfo,[43] etc., they thus enable interoperability between these vocabularies. OLiA is being developed as a community project on GitHub [44]
References
- ^ John R Taylor (1995) Linguistic Categorization: Prototypes in Linguistic Theory, 2nd ed., ch.2 p.21
- ^ Universal POS tags
- ^ The essentials of EAGLES
- ^ Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H. J., & Tufis, D. (1998, August). Multext-east: Parallel and comparable corpora and lexicons for six central and eastern european languages. In Proceedings of the 17th international conference on Computational linguistics-Volume 1 (pp. 315-319). Association for Computational Linguistics.
- ].
- ].
- ^ "Stanford Dependencies". nlp.stanford.edu. The Stanford Natural Language Processing Group. Retrieved 8 May 2020.
- ^ "Interset". cuni.cz. Institute of Formal and Applied Linguistics (Czech Republic). Retrieved 8 May 2020.
- ^ "Universal Dependencies". universaldependencies.org. Retrieved 2020-05-14.
- ^ "aux:pass". universaldependencies.org. Retrieved 2020-05-14.
- ^ UniMorph. "UniMorph: Universal Morphological Annotation". UniMorph. Retrieved 2020-05-14.
- ^ System-T/UniversalPropositions, System-T, 2020-05-14, retrieved 2020-05-14
- ^ Prange, J., Schneider, N., & Abend, O. (2019, August). Semantically Constrained Multilayer Annotation: The Case of Coreference. In Proceedings of the First International Workshop on Designing Meaning Representations (pp. 164-176).
- ^ "Penn Parsed Corpora of Historical English: Other Corpora". www.ling.upenn.edu. Retrieved 2020-05-14.
- ^ "Icelandic Parsed Historical Corpus (IcePaHC)". www.linguist.is. Retrieved 2020-05-14.
- ^ Warner, Anthony Department of Language and Linguistic Science University of York York; Taylor, Ann; Warner, Anthony; Pintzuk, Susan; Beths, Frank (September 2003). "The York-Toronto-Helsinki Parsed Corpus of Old English prose (YCOE)".
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ "Penn-Helsinki Parsed Corpus of Middle English 2". www.ling.upenn.edu. Retrieved 2020-05-14.
- ^ "Corpus of Historical Low German". www.chlg.ac.uk. Retrieved 2020-05-14.
- ^ Light, C., & Wallenberg, J. (2011). On the use of passives across Germanic. Presented at 13th Meeting of the Diachronic Generative Syntax (DIGS) Conference DIGS 13, University of Pennsylvania. June 5, 2011
- ^ Beatrice Santorini (1993) [./Ftp://babel.ling.upenn.edu/papers/faculty/beatrice%20santorini/santorini-1993.pdf The rate of phrase structure change in the history of Yiddish]. Language Variation and Change 5, 257-283.
- ^ "Tycho Brahe Project". www.tycho.iel.unicamp.br. Retrieved 2020-05-14.
- ^ "NPCMJ – Ninjal Parsed Corpus of Modern Japanese". Retrieved 2020-05-14.
- ^ "Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) - Linguistic Data Consortium". catalog.ldc.upenn.edu. Retrieved 2020-05-14.
- ^ "Penn Chinese Treebank Project". verbs.colorado.edu. Retrieved 2020-05-14.
- ^ Comrie, B., Haspelmath, M., & Bickel, B. (2008). The Leipzig Glossing Rules: Conventions for interlinear morpheme-by-morpheme glosses. Department of Linguistics of the Max Planck Institute for Evolutionary Anthropology & the Department of Linguistics of the University of Leipzig. Retrieved January, 28, 2010.
- ^ Scott Farrar and D. Terence Langendoen (2003) "A linguistic ontology for the Semantic Web." GLOT International. 7 (3), pp.97-100, [1].
- ^ GOLD versions
- ^ "ISO 12620:1999 - Computer applications in terminology -- Data categories". iso.org. 2011. Retrieved 9 November 2011.
- ^ "ISO 12620:2009 - Terminology and other language and content resources -- Specification of data categories and management of a Data Category Registry for language resources". iso.org. 2011. Retrieved 9 November 2011.
- ^ "ISO 12620:2019 Management of terminology resources — Data category specifications". ISO. Retrieved 20 January 2020.
- doi:10.7202/002101ar.
- ^ "ISO 12620:2019 Management of terminology resources — Data category specifications". ISO. Retrieved 20 January 2020.
- ^ "The Data Category Repository (DCR) has changed address". www.iso.org. Retrieved 2020-05-08.
- ^ "CLARIN Concept Registry | CLARIN ERIC". www.clarin.eu. Retrieved 2020-05-08.
- ^ "DatCatInfo". www.datcatinfo.net. Retrieved 2020-05-08.
- ^ "LexInfo". www.lexinfo.net. Retrieved 2020-05-14.
- ^ a b Cimiano, P., Chiarcos, C., McCrae, J. P., & Gracia, J. (2020). Linguistic Linked Data (pp. 137-160). Springer, Cham.
- ^ ontolex/lexinfo, OntoLex Community Group, 2020-03-07, retrieved 2020-05-14
- ^ "OLiA ontologies". purl.org/olia. Retrieved 2020-05-14.
- ^ a b Chiarcos, C. (2008). An ontology of linguistic annotations. In LDV Forum (Vol. 23, No. 1, pp. 1-16).
- ^ Chiarcos, C. (2010, May). Grounding an ontology of linguistic annotations in the Data Category Registry. In LREC 2010 Workshop on Language Resource and Language Technology Standards (LT<S), Valetta, Malta (pp. 37-40).
- arXiv:2004.08355.
- ^ a b Christian Chiarcos, Maxim Ionov and Christian Fäth (2020), Annotation interoperability in the post-ISOcat era, LREC 2020
- ^ acoli-repo/olia, ACoLi, 2020-03-10, retrieved 2020-05-14