Trie
Trie | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Type | Tree | |||||||||||||||||||||||
Invented | 1960 | |||||||||||||||||||||||
Invented by | Edward Fredkin, Axel Thue, and René de la Briandais | |||||||||||||||||||||||
|
In computer science, a trie (/ˈtraɪ/, /ˈtriː/), also called digital tree or prefix tree,[1] is a type of k-ary search tree, a tree data structure used for locating specific keys from within a set. These keys are most often strings, with links between nodes defined not by the entire key, but by individual characters. In order to access a key (to recover its value, change it, or remove it), the trie is traversed depth-first, following the links between nodes, which represent each character in the key.
Unlike a binary search tree, nodes in the trie do not store their associated key. Instead, a node's position in the trie defines the key with which it is associated. This distributes the value of each key across the data structure, and means that not every node necessarily has an associated value.
All the children of a node have a common prefix of the string associated with that parent node, and the root is associated with the empty string. This task of storing data accessible by its prefix can be accomplished in a memory-optimized way by employing a radix tree.
Though tries can be keyed by character strings, they need not be. The same algorithms can be adapted for ordered lists of any underlying type, e.g. permutations of digits or shapes. In particular, a bitwise trie is keyed on the individual bits making up a piece of fixed-length binary data, such as an integer or memory address. The key lookup complexity of a trie remains proportional to the key size. Specialized trie implementations such as compressed tries are used to deal with the enormous space requirement of a trie in naive implementations.
History, etymology, and pronunciation
The idea of a trie for representing a set of strings was first abstractly described by Axel Thue in 1912.[2][3] Tries were first described in a computer context by René de la Briandais in 1959.[4][3][5]: 336
The idea was independently described in 1960 by Edward Fredkin,[6] who coined the term trie, pronouncing it /ˈtriː/ (as "tree"), after the middle syllable of retrieval.[7][8] However, other authors pronounce it /ˈtraɪ/ (as "try"), in an attempt to distinguish it verbally from "tree".[7][8][3]
Overview
Tries are a form of string-indexed look-up data structure, which is used to store a dictionary list of words that can be searched on in a manner that allows for efficient generation of
Tries can be efficacious on
Operations
Tries support various operations: insertion, deletion, and lookup of a string key. Tries are composed of nodes that contain links, which either point to other suffix child nodes or null. As for every tree, each node but the root is pointed to by only one other node, called its parent. Each node contains as many links as the number of characters in the applicable alphabet (although tries tend to have a substantial number of null links). In some cases, the alphabet used is simply that of the character encoding—resulting in, for example, a size of 256 in the case of (unsigned) ASCII.[14]: 732
The null links within the children of a node emphasize the following characteristics:[14]: 734 [5]: 336
- Characters and string keys are implicitly stored in the trie, and include a character sentinel value indicating string termination.
- Each node contains one possible link to a prefix of strong keys of the set.
A basic structure type of nodes in the trie is as follows; may contain an optional , which is associated with each key stored in the last character of string, or terminal node.
structure Node Children Node[Alphabet-Size] Is-Terminal Boolean Value Data-Type end structure |
Searching
Searching for a value in a trie is guided by the characters in the search string key, as each node in the trie contains a corresponding link to each possible character in the given string. Thus, following the string within the trie yields the associated value for the given string key. A null link during the search indicates the inexistence of the key.[14]: 732-733
The following pseudocode implements the search procedure for a given string key in a rooted trie x.[15]: 135
Trie-Find(x, key) for 0 ≤ i < key.length do if x.Children[key[i]] = nil then return false end if x := x.Children[key[i]] repeat return x.Value |
In the above pseudocode, x and key correspond to the pointer of trie's root node and the string key respectively. The search operation, in a standard trie, takes time, where is the size of the string parameter , and corresponds to the
The trie occupies less space in comparison with a BST in the case of a large number of short strings, since nodes share common initial string subsequences and store the keys implicitly.[12]: 358 The terminal node of the tree contains a non-null value, and it is a search hit if the associated value is found in the trie, and search miss if it is not.[14]: 733
Insertion
Insertion into trie is guided by using the
1 2 3 4 5 6 7 8 9 |
Trie-Insert(x, key, value) for 0 ≤ i < key.length do if x.Children[key[i]] = nil then x.Children[key[i]] := Node() end if x := x.Children[key[i]] repeat x.Value := value x.Is-Terminal := True |
If a null link is encountered prior to reaching the last character of the string key, a new node is created (line 3).[14]: 745 The value of the terminal node is assigned to the input value; therefore, if the former was non-null at the time of insertion, it is substituted with the new value.
Deletion
Deletion of a
The following is a recursive procedure for removing a string key from rooted trie (x).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Trie-Delete(x, key) if key = nil then if x.Is-Terminal = True then x.Is-Terminal := False x.Value := nil end if for 0 ≤ i < x.Children.length if x.Children[i] != nil return x end if repeat return nil end if x.Children[key[0]] := Trie-Delete(x.Children[key[0]], key[1:]) return x |
The procedure begins by examining the key; null denotes the arrival of a terminal node or end of a string key. If the node is terminal it has no children, it is removed from the trie (line 14). However, an end of string key without the node being terminal indicates that the key does not exist, thus the procedure does not modify the trie. The recursion proceeds by incrementing key's index.
Replacing other data structures
Replacement for hash tables
A trie can be used to replace a hash table, over which it has the following advantages:[12]: 358
- Searching for a node with an associated key of size has the complexity of , whereas an imperfect hash function may have numerous colliding keys, and the worst-case lookup speed of such a table would be , where denotes the total number of nodes within the table.
- Tries do not need a hash function for the operation, unlike a hash table; there are also no collisions of different keys in a trie.
- Buckets in a trie, which are analogous to hash table buckets that store key collisions, are necessary only if a single key is associated with more than one value.
- String keys within the trie can be sorted using a predetermined alphabetical ordering.
However, tries are less efficient than a hash table when the data is directly accessed on a
Implementation strategies
Tries can be represented in several ways, corresponding to different trade-offs between memory use and speed of the operations.
Techniques such as alphabet reduction may alleviate the high space complexity by reinterpreting the original string as a long string over a smaller alphabet i.e. a string of n bytes can alternatively be regarded as a string of 2n four-bit units and stored in a trie with sixteen pointers per node. However, lookups need to visit twice as many nodes in the worst-case, although space requirements go down by a factor of eight.[5]: 347–352 Other techniques include storing a vector of 256 ASCII pointers as a bitmap of 256 bits representing ASCII alphabet, which reduces the size of individual nodes dramatically.[18]
Bitwise tries
Bitwise tries are used to address the enormous space requirement for the trie nodes in a naive simple pointer vector implementations. Each character in the string key set is represented via individual bits, which are used to traverse the trie over a string key. The implementations for these types of trie use
__builtin_clz()
intrinsic function). Accordingly, the set bit is used to index the first item, or child node, in the 32- or 64-entry based bitwise tree. Search then proceeds by testing each subsequent bit in the key.[19]This procedure is also
Compressed tries
Radix tree, also known as a compressed trie, is a space-optimized variant of a trie in which nodes with only one child get merged with its parents; elimination of branches of the nodes with a single child results in better in both space and time metrics.[20][21]: 452 This works best when the trie remains static and set of keys stored are very sparse within their representation space.[22]: 3–16
One more approach is to "pack" the trie, in which a space-efficient implementation of a sparse packed trie applied to automatic
Patricia trees
Patricia trees are a particular implementation of the compressed binary trie that uses the binary encoding of the string keys in its representation.[23][15]: 140 Every node in a Patricia tree contains an index, known as a "skip number", that stores the node's branching index to avoid empty subtrees during traversal.[15]: 140-141 A naive implementation of a trie consumes immense storage due to larger number of leaf-nodes caused by sparse distribution of keys; Patricia trees can be efficient for such cases.[15]: 142 [24]: 3
A representation of a Patricia tree is shown to the right. Each index value adjacent to the nodes represents the "skip number"—the index of the bit with which branching is to be decided.[24]: 3 The skip number 1 at node 0 corresponds to the position 1 in the binary encoded ASCII where the leftmost bit differed in the key set .
Applications
Trie data structures are commonly used in
Sorting
Lexicographic sorting of a set of string keys can be implemented by building a trie for the given keys and traversing the tree in pre-order fashion;[26] this is also a form of radix sort.[27] Tries are also fundamental data structures for burstsort, which is notable for being the fastest string sorting algorithm as of 2007,[28] accompanied for its efficient use of CPU cache.[29]
Full-text search
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text to carry out fast full-text searches.[30]
Web search engines
A specialized kind of trie called a compressed trie, is used in
Bioinformatics
Tries are used in Bioinformatics, notably in sequence alignment software applications such as BLAST, which indexes all the different substring of length k (called k-mers) of a text by storing the positions of their occurrences in a compressed trie sequence databases.[25]: 75
Internet routing
Compressed variants of tries, such as databases for managing
See also
References
- ^ a b Maabar, Maha (17 November 2014). "Trie Data Structure". CVR, University of Glasgow. Archived from the original on 27 January 2021. Retrieved 17 April 2022.
- ^ Thue, Axel (1912). "Über die gegenseitige Lage gleicher Teile gewisser Zeichenreihen". Skrifter Udgivne Af Videnskabs-Selskabet I Christiania. 1912 (1): 1–67. Cited by Knuth.
- ^ ISBN 0-201-89685-0.
- S2CID 10963780. Archived from the original(PDF) on 2020-02-11. Cited by Brass and by Knuth.
- ^ ISBN 978-0521880374.
- ^ S2CID 15384533.
- ^ a b Black, Paul E. (2009-11-16). "trie". Dictionary of Algorithms and Data Structures. National Institute of Standards and Technology. Archived from the original on 2011-04-29.
- ^ a b c d e Franklin Mark Liang (1983). Word Hy-phen-a-tion By Com-put-er (PDF) (Doctor of Philosophy thesis). Stanford University. Archived (PDF) from the original on 2005-11-11. Retrieved 2010-03-28.
- ^ "Trie". School of Arts and Science, Rutgers University. 2022. Archived from the original on 17 April 2022. Retrieved 17 April 2022.
- S2CID 18747244.
- ^ S2CID 207735784.
- ^ ISBN 9780198099307.
- ISBN 978-3-540-40391-3.
- ^ ISBN 978-0321573513.
- ^ ISBN 978-0-201-41607-7.
- ISBN 9780198066231.
- ^ S. Orley; J. Mathews. "The IEEE 754 Format". Department of Mathematics and Computer Science, Emory University. Archived from the original on 28 March 2022. Retrieved 17 April 2022.
- S2CID 12943246.
- ^ .
- ^ Sartaj Sahni (2004). "Data Structures, Algorithms, & Applications in C++: Tries". University of Florida. Archived from the original on 3 July 2016. Retrieved 17 April 2022.
- ISBN 978-1498701853.
- .
- ^ "Patricia tree". National Institute of Standards and Technology. Archived from the original on 14 February 2022. Retrieved 17 April 2022.
- ^ ISBN 978-0-387-49616-0 – via HAL (open archive).
- ^ ISSN 0306-4379.
- ^ Kärkkäinen, Juha. "Lecture 2" (PDF). University of Helsinki.
The preorder of the nodes in a trie is the same as the lexicographical order of the strings they represent assuming the children of a node are ordered by the edge labels.
- ^ Kallis, Rafael (2018). "The Adaptive Radix Tree (Report #14-708-887)" (PDF). University of Zurich: Department of Informatics, Research Publications.
- S2CID 3184411.
- ISBN 978-3-540-89096-6.
- ISSN 0097-5397.
- S2CID 37884057.
- S2CID 932749.