Word list
This article has an unclear citation style. (March 2021) |
This article includes a list of general references, but it lacks sufficient corresponding inline citations. (December 2023) |
A word list (or lexicon) is a list of a language's
In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.
Type | Occurrences | Rank |
---|---|---|
the | 3,789,654 | 1st |
he | 2,098,762 | 2nd |
[...] | ||
king | 57,897 | 1,356th |
boy | 56,975 | 1,357th |
[...] | ||
stringyfy | 5 | 34,589th |
[...] | ||
transducionalify | 1 | 123,567th |
Methodology
Factors
Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:
- corpus representativeness
- word frequency and range
- treatment of word families
- treatment of idioms and fixed expressions
- range of information
- various other criteria
Corpora
Traditional written corpus
Most of currently available studies are based on written text corpus, more easily available and easy to process.
SUBTLEX movement
However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of this traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. This has recently been followed by a handful of follow-up studies,[1] providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. 2010), Vietnamese (Pham, Bolger & Baayen 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013), Polish (Mandera et al. 2014) and Catalan (2019[2]). SUBTLEX-IT (2015) provides raw data only.[3]
Lexical unit
In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a
Statistics
It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.
German linguists define the Häufigkeitsklasse (frequency class) of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.
where is the
Frequency lists, together with
Pedagogy
Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006).
Effects of words frequency
Word frequency is known to have various effects (
Languages
Below is a review of available resources.
English
Word counting is an ancient field,
- The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)
The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (Nation 1997).
- The General Service List (West, 1953)
The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (Nation 1997). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.
- The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)
A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (Nation 1997).
- The Brown (Francis and Kucera, 1982) LOB and related corpora
These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).
French
- Traditional datasets
A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules.[11] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence,[12][13] while 3,680 words make about 95~98% of coverage.[14] A list of 3,000 frequent words is available.[15]
The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet.[16] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".[17]
More recently, the project
- Subtlex
This Lexique3 is a continuous study from which originate the Subtlex movement cited above. New et al. 2007 made a completely new counting based on online film subtitles.
Spanish
There have been several studies of Spanish word frequency (Cuetos et al. 2011).[19]
Chinese
Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (
Other
Most frequently used words in different languages based on Wikipedia or combined corpora.[20]
See also
- Letter frequency
- Most common words in English
- Long tail
- Google Ngram Viewer – shows changes in word/phrase frequency (and relative frequency) over time
Notes
- ^ "Crr » Subtitle Word Frequencies".
- S2CID 84843788.
- ^ Amenta, Simona; Mandera, Paweł; Keuleers, Emmanuel; Brysbaert, Marc; Crepaldi, Davide (7 January 2022). "SUBTLEX-IT".
- ISSN 0270-2711.
- ^ "APA PsycNet". psycnet.apa.org. Retrieved 2023-05-15.
- ^ "Words and phrases: Frequency, genres, collocates, concordances, synonyms, and WordNet".
- ^ "Corpus of Contemporary American English (COCA)".
- ^ "It's the links, stupid". The Economist. 20 April 2006. Retrieved 2008-06-05.
- ^ Merholz, Peter (1999). "Peterme.com". Internet Archive. Archived from the original on 1999-10-13. Retrieved 2008-06-05.
- ^ Kottke, Jason (26 August 2003). "kottke.org". Retrieved 2008-06-05.
- ^ "Le français fondamental". Archived from the original on 2010-07-04.
- ^ Ouzoulias, André (2004), Comprendre et aider les enfants en difficulté scolaire: Le Vocabulaire fondamental, 70 mots essentiels (PDF), Retz - Citing V.A.C Henmon (dead link, no Internet Archive copy, 10 August 2023)
- ^ Liste des "70 mots essentiels" recensés par V.A.C. Henmon
- ^ "Generalities".
- ^ "PDF 3000 French words".
- ^ "Maitrise de la langue à l'école: Vocabulaire". Ministère de l'éducation nationale.
- ISBN 978-2-7606-1563-2
- ^ "Lexique".
- ^ "Spanish word frequency lists". Vocabularywiki.pbworks.com.
- ^ Most frequently used words in different languages, ezglot
References
Theoretical concepts
- Nation, P. (1997), "Vocabulary size, text coverage, and word lists", in Schmitt; McCarthy (eds.), Vocabulary: Description, Acquisition and Pedagogy, Cambridge: Cambridge University Press, pp. 6–19, ISBN 978-0-521-58551-4
- Laufer, B. (1997), "What's in a word that makes it hard or easy? Some intralexical factors that affect the learning of words.", Vocabulary: Description, Acquisition and Pedagogy, Cambridge: Cambridge University Press, pp. 140–155, ISBN 9780521585514
- ISBN 9780080448541.
- Brysbaert, Marc; Buchmeier, Matthias; Conrad, Markus; Jacobs, Arthur M.; Bölte, Jens; Böhl, Andrea (2011). "The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German". Experimental Psychology. 58 (5): 412–424.
- Rudell, A.P. (1993), "Frequency of word usage and perceived word difficulty : Ratings of Kucera and Francis words", Most, vol. 25, pp. 455–463
- Segui, J.; Mehler, Jacques; Frauenfelder, Uli; Morton, John (1982), "The word frequency effect and lexical access", Neuropsychologia, 20 (6): 615–627, S2CID 39694258
- Meier, Helmut (1967), Deutsche Sprachstatistik, Hildesheim: Olms (frequency list of German words)
- DeFrancis, John (1966), Why Johnny can't read Chinese
- Allanic, Bernard (2003), The corpus of characters and their pedagogical aspect in ancient and contemporary China (fr: Les corpus de caractères et leur dimension pédagogique dans la Chine ancienne et contemporaine) (These de doctorat), Paris: INALCO
Written texts-based databases
- Da, Jun (1998), Jun Da: Chinese text computing, retrieved 2010-08-21.
- Taiwan Ministry of Education (1997), 八十六年常用語詞調查報告書, retrieved 2010-08-21.
- New, Boris; Pallier, Christophe, Manuel de Lexique 3 (in French) (3.01 ed.).
- Gimenes, Manuel; New, Boris (2016), "Worldlex: Twitter and blog word frequencies for 66 languages", Behavior Research Methods, 48 (3): 963–972, PMID 26170053.
SUBTLEX movement
- New, B.; Brysbaert, M.; Veronis, J.; Pallier, C. (2007). "SUBTLEX-FR: The use of film subtitles to estimate word frequencies" (PDF). Applied Psycholinguistics. 28 (4): 661. S2CID 145366468. Archived from the original(PDF) on 2016-10-24.
- Brysbaert, Marc; New, Boris (2009), "Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English" (PDF), Behavior Research Methods, 41 (4): 977–990, S2CID 4792474
- Keuleers, E, M, B.; New, B. (2010), "SUBTLEX--NL: A new measure for Dutch word frequency based on film subtitles", Behavior Research Methods, 42 (3): 643–650, )
- Cai, Q.; Brysbaert, M. (2010), "SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles", PLOS ONE, 5 (6): 8, PMID 20532192
- Cuetos, F.; Glez-nosti, Maria; Barbón, Analía; Brysbaert, Marc (2011), "SUBTLEX-ESP : Spanish word frequencies based on film subtitles" (PDF), Psicológica, 32: 133–143
- Dimitropoulou, M.; Duñabeitia, Jon Andoni; Avilés, Alberto; Corral, José; Carreiras, Manuel (2010), "SUBTLEX-GR: Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek", Frontiers in Psychology, 1 (December): 12, PMID 21833273
- Pham, H.; Bolger, P.; Baayen, R.H. (2011), "SUBTLEX-VIE : A Measure for Vietnamese Word and Character Frequencies on Film Subtitles", ACOL
- Brysbaert, M.; New, Boris; Keuleers, E. (2012), "SUBTLEX-US : Adding Part of Speech Information to the SUBTLEXus Word Frequencies" (PDF), Behavior Research Methods: 1–22 (databases)
- Mandera, P.; Keuleers, E.; Wodniecka, Z.; Brysbaert, M. (2014). "Subtlex-pl: subtitle-based word frequency estimates for Polish" (PDF). Behav Res Methods. 47 (2): 471–483. S2CID 2334688.
- Tang, K. (2012), "A 61 million word corpus of Brazilian Portuguese film subtitles as a resource for linguistic research", UCL Work Pap Linguist (24): 208–214
- Avdyli, Rrezarta; Cuetos, Fernando (June 2013), "SUBTLEX- AL: Albanian word frequencies based on film subtitles", ILIRIA International Review, 3 (1): 285–292, ISSN 2365-8592
- Soares, Ana Paula; Machado, João; Costa, Ana; Iriarte, Álvaro; Simões, Alberto; de Almeida, José João; Comesaña, Montserrat; Perea, Manuel (April 2015), "On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese", The Quarterly Journal of Experimental Psychology, 68 (4): 680–696, S2CID 5376519