Parsing
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a
The term has slightly different meanings in different branches of linguistics and computer science. Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as sentence diagrams. It usually emphasizes the importance of grammatical divisions such as subject and predicate.
Within computational linguistics the term is used to refer to the formal analysis by a computer of a sentence or other string of words into its constituents, resulting in a parse tree showing their syntactic relation to each other, which may also contain semantic information.[citation needed] Some parsing algorithms generate a parse forest or list of parse trees from a string that is syntactically ambiguous.[2]
The term is also used in
Within computer science, the term is used in the analysis of
Human languages
Traditional methods
The traditional grammatical exercise of parsing, sometimes known as clause analysis, involves breaking down a text into its component
Parsing was formerly central to the teaching of grammar throughout the English-speaking world, and widely regarded as basic to the use and understanding of written language. However, the general teaching of such techniques is no longer current.[citation needed]
Computational methods
This section needs additional citations for verification. (February 2013) |
In some machine translation and natural language processing systems, written texts in human languages are parsed by computer programs.[4] Human sentences are not easily parsed by programs, as there is substantial ambiguity in the structure of human language, whose usage is to convey meaning (or semantics) amongst a potentially unlimited range of possibilities, but only some of which are germane to the particular case.[5] So an utterance "Man bites dog" versus "Dog bites man" is definite on one detail but in another language might appear as "Man dog bites" with a reliance on the larger context to distinguish between those two possibilities, if indeed that difference was of concern. It is difficult to prepare formal rules to describe informal behaviour even though it is clear that some rules are being followed.[citation needed]
In order to parse natural language data, researchers must first agree on the
Most modern parsers are at least partly
Parsing algorithms for natural language cannot rely on the grammar having 'nice' properties as with manually designed grammars for programming languages. As mentioned earlier some grammar formalisms are very difficult to parse computationally; in general, even if the desired structure is not
Psycholinguistics
In psycholinguistics, parsing involves not just the assignment of words to categories (formation of ontological insights), but the evaluation of the meaning of a sentence according to the rules of syntax drawn by inferences made from each word in the sentence (known as connotation). This normally occurs as words are being heard or read.
Neurolinguistics generally understands parsing to be a function of working memory, meaning that parsing is used to keep several parts of one sentence at play in the mind at one time, all readily accessible to be analyzed as needed. Because the human working memory has limitations, so does the function of sentence parsing.[10] This is evidenced by several different types of syntactically complex sentences that propose potentially issues for mental parsing of sentences.
The first, and perhaps most well-known, type of sentence that challenges parsing ability is the garden-path sentence. These sentences are designed so that the most common interpretation of the sentence appears grammatically faulty, but upon further inspection, these sentences are grammatically sound. Garden-path sentences are difficult to parse because they contain a phrase or a word with more than one meaning, often their most typical meaning being a different part of speech.[11] For example, in the sentence, "the horse raced past the barn fell", raced is initially interpreted as a past tense verb, but in this sentence, it functions as part of an adjective phrase.[12] Since parsing is used to identify parts of speech, these sentences challenge the parsing ability of the reader.
Another type of sentence that is difficult to parse is an attachment ambiguity, which includes a phrase that could potentially modify different parts of a sentence, and therefore presents a challenge in identifying syntactic relationship (i.e. "The boy saw the lady with the telescope", in which the ambiguous phrase with the telescope could modify the boy saw or the lady.) [11]
A third type of sentence that challenges parsing ability is center embedding, in which phrases are placed in the center of other similarly formed phrases (i.e. "The rat the cat the man hit chased ran into the trap".) Sentences with 2 or in the most extreme cases 3 center embeddings are challenging for mental parsing, again because of ambiguity of syntactic relationship.[13]
Within neurolinguistics there are multiple theories that aim to describe how parsing takes place in the brain. One such model is a more traditional generative model of sentence processing, which theorizes that within the brain there is a distinct module designed for sentence parsing, which is preceded by access to lexical recognition and retrieval, and then followed by syntactic processing that considers a single syntactic result of the parsing, only returning to revise that syntactic interpretation if a potential problem is detected.[14] The opposing, more contemporary model theorizes that within the mind, the processing of a sentence is not modular, or happening in strict sequence. Rather, it poses that several different syntactic possibilities can be considered at the same time, because lexical access, syntactic processing, and determination of meaning occur in parallel in the brain. In this way these processes are integrated.[15]
Although there is still much to learn about the neurology of parsing, studies have shown evidence that several areas of the brain might play a role in parsing. These include the left anterior temporal pole, the left inferior frontal gyrus, the left superior temporal gyrus, the left superior frontal gyrus, the right posterior cingulate cortex, and the left angular gyrus. Although it has not been absolutely proven, it has been suggested that these different structures might favor either phrase-structure parsing or dependency-structure parsing, meaning different types of parsing could be processed in different ways which have yet to be understood.[16]
Discourse analysis
Discourse analysis examines ways to analyze language use and semiotic events. Persuasive language may be called rhetoric.
Computer languages
Parser
A parser is a software component that takes input data (typically text) and builds a
The input to a parser is typically text in some
The use of parsers varies by input. In the case of data languages, a parser is often found as the file reading facility of a program, such as reading in HTML or
The implied disadvantages of a one-pass compiler can largely be overcome by adding fix-ups, where provision is made for code relocation during the forward pass, and the fix-ups are applied backwards when the current program segment has been recognized as having been completed. An example where such a fix-up mechanism would be useful would be a forward GOTO statement, where the target of the GOTO is unknown until the program segment is completed. In this case, the application of the fix-up would be delayed until the target of the GOTO was recognized. Conversely, a backward GOTO does not require a fix-up, as the location will already be known.
Context-free grammars are limited in the extent to which they can express all of the requirements of a language. Informally, the reason is that the memory of such a language is limited. The grammar cannot remember the presence of a construct over an arbitrarily long input; this is necessary for a language in which, for example, a name must be declared before it may be referenced. More powerful grammars that can express this constraint, however, cannot be parsed efficiently. Thus, it is a common strategy to create a relaxed parser for a context-free grammar which accepts a superset of the desired language constructs (that is, it accepts some invalid constructs); later, the unwanted constructs can be filtered out at the semantic analysis (contextual analysis) step.
For example, in Python the following is syntactically valid code:
x = 1;
print(x);
The following code, however, is syntactically valid in terms of the context-free grammar, yielding a syntax tree with the same structure as the previous, but violates the semantic rule requiring variables to be initialized before use:
x = 1
print(y)
Overview of process
The following example demonstrates the common case of parsing a computer language with two levels of grammar: lexical and syntactic.
The first stage is the token generation, or lexical analysis, by which the input character stream is split into meaningful symbols defined by a grammar of regular expressions. For example, a calculator program would look at an input such as "12 * (3 + 4)^2
" and split it into the tokens 12
, *
, (
, 3
, +
, 4
, )
, ^
, 2
, each of which is a meaningful symbol in the context of an arithmetic expression. The lexer would contain rules to tell it that the characters *
, +
, ^
, (
and )
mark the start of a new token, so meaningless tokens like "12*
" or "(3
" will not be generated.
The next stage is parsing or syntactic analysis, which is checking that the tokens form an allowable expression. This is usually done with reference to a context-free grammar which recursively defines components that can make up an expression and the order in which they must appear. However, not all rules defining programming languages can be expressed by context-free grammars alone, for example type validity and proper declaration of identifiers. These rules can be formally expressed with attribute grammars.
The final phase is
Types of parsers
The task of the parser is essentially to determine if and how the input can be derived from the start symbol of the grammar. This can be done in essentially two ways:
- Top-down parsing
- Top-down parsing can be viewed as an attempt to find left-most derivations of an input-stream by searching for parse trees using a top-down expansion of the given formal grammar rules. Tokens are consumed from left to right. Inclusive choice is used to accommodate ambiguity by expanding all alternative right-hand-sides of grammar rules.[18] This is known as the primordial soup approach. Very similar to sentence diagramming, primordial soup breaks down the constituencies of sentences.[19]
- Bottom-up parsing
- A parser can start with the input and attempt to rewrite it to the start symbol. Intuitively, the parser attempts to locate the most basic elements, then the elements containing these, and so on. LR parsers are examples of bottom-up parsers. Another term used for this type of parser is Shift-Reduce parsing.
An important distinction with regard to parsers is whether a parser generates a leftmost derivation or a rightmost derivation (see context-free grammar). LL parsers will generate a leftmost derivation and LR parsers will generate a rightmost derivation (although usually in reverse).[18]
Some graphical parsing algorithms have been designed for
Implementation
A simple parser implementation reads the entire input file, performs an intermediate computation or translation, and then writes the entire output file, such as in-memory multi-pass compilers.
Alternative parser implementation approaches:
- push parsers call registered handlers (callbacks) as soon as the parser detects relevant tokens in the input stream. A push parser may skip parts of the input that are irrelevant (an example is Expat).
- pull parsers, such as parsers that are typically used by compilersfront-ends by "pulling" input text.
- incremental parsers (such as incremental chart parsers) that, as the text of the file is edited by a user, does not need to completely re-parse the entire file.
- Active versus passive parsers[26][27]
Parser development software
This article is in prose. is available. (January 2017) |
Some of the well known parser development tools include the following:
Lookahead
Lookahead establishes the maximum incoming tokens that a parser can use to decide which rule it should use. Lookahead is especially relevant to LL, LR, and LALR parsers, where it is often explicitly indicated by affixing the lookahead to the algorithm name in parentheses, such as LALR(1).
Most
LR parsers typically have only a few actions after seeing each token. They are shift (add this token to the stack for later reduction), reduce (pop tokens from the stack and form a syntactic construct), end, error (no known rule applies) or conflict (does not know whether to shift or reduce).
Lookahead has two advantages.[clarification needed]
- It helps the parser take the correct action in case of conflicts. For example, parsing the if statement in the case of an else clause.
- It eliminates many duplicate states and eases the burden of an extra stack. A C language non-lookahead parser will have around 10,000 states. A lookahead parser will have around 300 states.
Example: Parsing the Expression 1 + 2 * 3[dubious ]
Set of expression parsing rules (called grammar) is as follows, | ||
Rule1: | E → E + E | Expression is the sum of two expressions. |
Rule2: | E → E * E | Expression is the product of two expressions. |
Rule3: | E → number | Expression is a simple number |
Rule4: | + has less precedence than * |
Most programming languages (except for a few such as APL and Smalltalk) and algebraic formulas give higher precedence to multiplication than addition, in which case the correct interpretation of the example above is 1 + (2 * 3). Note that Rule4 above is a semantic rule. It is possible to rewrite the grammar to incorporate this into the syntax. However, not all such rules can be translated into syntax.
- Simple non-lookahead parser actions
Initially Input = [1, +, 2, *, 3]
- Shift "1" onto stack from input (in anticipation of rule3). Input = [+, 2, *, 3] Stack = [1]
- Reduces "1" to expression "E" based on rule3. Stack = [E]
- Shift "+" onto stack from input (in anticipation of rule1). Input = [2, *, 3] Stack = [E, +]
- Shift "2" onto stack from input (in anticipation of rule3). Input = [*, 3] Stack = [E, +, 2]
- Reduce stack element "2" to Expression "E" based on rule3. Stack = [E, +, E]
- Reduce stack items [E, +, E] and new input "E" to "E" based on rule1. Stack = [E]
- Shift "*" onto stack from input (in anticipation of rule2). Input = [3] Stack = [E,*]
- Shift "3" onto stack from input (in anticipation of rule3). Input = [] (empty) Stack = [E, *, 3]
- Reduce stack element "3" to expression "E" based on rule3. Stack = [E, *, E]
- Reduce stack items [E, *, E] and new input "E" to "E" based on rule2. Stack = [E]
The parse tree and resulting code from it is not correct according to language semantics.
To correctly parse without lookahead, there are three solutions:
- The user has to enclose expressions within parentheses. This often is not a viable solution.
- The parser needs to have more logic to backtrack and retry whenever a rule is violated or not complete. The similar method is followed in LL parsers.
- Alternatively, the parser or grammar needs to have extra logic to delay reduction and reduce only when it is absolutely sure which rule to reduce first. This method is used in LR parsers. This correctly parses the expression but with many more states and increased stack depth.
- Lookahead parser actions[clarification needed]
- Shift 1 onto stack on input 1 in anticipation of rule3. It does not reduce immediately.
- Reduce stack item 1 to simple Expression on input + based on rule3. The lookahead is +, so we are on path to E +, so we can reduce the stack to E.
- Shift + onto stack on input + in anticipation of rule1.
- Shift 2 onto stack on input 2 in anticipation of rule3.
- Reduce stack item 2 to Expression on input * based on rule3. The lookahead * expects only E before it.
- Now stack has E + E and still the input is *. It has two choices now, either to shift based on rule2 or reduction based on rule1. Since * has higher precedence than + based on rule4, we shift * onto stack in anticipation of rule2.
- Shift 3 onto stack on input 3 in anticipation of rule3.
- Reduce stack item 3 to Expression after seeing end of input based on rule3.
- Reduce stack items E * E to E based on rule2.
- Reduce stack items E + E to E based on rule1.
The parse tree generated is correct and simply more efficient[clarify][citation needed] than non-lookahead parsers. This is the strategy followed in LALR parsers.
List of parsing algorithms
- CYK algorithm: an O(n3) algorithm for parsing context-free grammars in Chomsky normal form
- Earley parser: another O(n3) algorithm for parsing any context-free grammar
- linear timeand O(n3) in worst case.
- Inside-outside algorithm: an O(n3) algorithm for re-estimating production probabilities in probabilistic context-free grammars
- linear time parsing algorithm for a limited class of context-free grammars
- linear time parsing algorithm for a larger class of context-free grammars. Variants:
- Canonical LR parser
- LALR (look-ahead LR) parser
- Operator-precedence parser
- SLR (Simple LR) parser
- Simple precedence parser
- linear time parsing algorithm supporting some context-free grammars and parsing expression grammars
- Recursive descent parser: a top-down parser suitable for LL(k) grammars
- Shunting-yard algorithm: converts an infix-notation math expression to postfix
- Pratt parser
- Lexical analysis
See also
References
- ^ a b "Parse". dictionary.reference.com. Retrieved 27 November 2010.
- ISBN 978-1-4615-4034-2.
- ^ "Grammar and Composition". Archived from the original on 2016-12-01. Retrieved 2012-11-24.
- ISBN 978-0-262-13360-9.
- .
- ^ Klein, Dan, and Christopher D. Manning. "Accurate unlexicalized parsing." Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 2003.
- ^ Charniak, Eugene. "A maximum-entropy-inspired parser Archived 2019-04-01 at the Wayback Machine." Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference. Association for Computational Linguistics, 2000.
- ^ Chen, Danqi, and Christopher Manning. "A fast and accurate dependency parser using neural networks." Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014.
- arXiv:1606.03622 [cs.CL].
- ^ Sandra H. Vos, Thomas C. Gunter, Herbert Schriefers & Angela D. Friederici (2001) Syntactic parsing and working memory: The effects of syntactic complexity, reading span, and concurrent load, Language and Cognitive Processes, 16:1, 65-103, DOI: 10.1080/01690960042000085
- ^ a b Pritchett, B. L. (1988). Garden Path Phenomena and the Grammatical Basis of Language Processing. Language, 64(3), 539–576. https://doi.org/10.2307/414532
- OCLC 43300456.
- ^ Karlsson, F. (2010). Working Memory Constraints on Multiple Center-Embedding. Proceedings of the Annual Meeting of the Cognitive Science Society, 32. Retrieved from https://escholarship.org/uc/item/4j00v1j2
- ^ Ferreira, F., & Clifton, C. (1986). The independence of syntactic processing. Journal of Memory and Language, 25(3), 348–368. https://doi.org/10.1016/0749-596X(86)90006-9
- ^ Atlas, J. D. (1997). On the modularity of sentence processing: semantical generality and the language of thought. Language and Conceptualization, 213–214.
- ^ Lopopolo, Alessandro, van den Bosch, Antal, Petersson, Karl-Magnus, and Roel M. Willems; Distinguishing Syntactic Operations in the Brain: Dependency and Phrase-Structure Parsing. Neurobiology of Language 2021; 2 (1): 152–175. doi: https://doi.org/10.1162/nol_a_00029
- ^ Berant, Jonathan, and Percy Liang. "Semantic parsing via paraphrasing." Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2014.
- ^ Addison-Wesley LongmanPublishing Co., Inc. Boston, MA, USA.
- )
- ^ Frost, R., Hafiz, R. and Callaghan, P. (2007) " Modular and Efficient Top-Down Parsing for Ambiguous Left-Recursive Grammars Archived 2018-08-22 at the Wayback Machine ." 10th International Workshop on Parsing Technologies (IWPT), ACL-SIGPARSE , Pages: 109 - 120, June 2007, Prague.
- ^ Frost, R., Hafiz, R. and Callaghan, P. (2008) " Parser Combinators for Ambiguous Left-Recursive Grammars." 10th International Symposium on Practical Aspects of Declarative Languages (PADL), ACM-SIGPLAN , Volume 4902/2008, Pages: 167 - 181, January 2008, San Francisco.
- ^ Rekers, Jan, and Andy Schürr. "Defining and parsing visual languages with layered graph grammars." Journal of Visual Languages & Computing 8.1 (1997): 27-55.
- ^ Rekers, Jan, and A. Schurr. "A graph grammar approach to graphical parsing." Visual Languages, Proceedings., 11th IEEE International Symposium on. IEEE, 1995.
- ^ Zhang, Da-Qian, Kang Zhang, and Jiannong Cao. "A context-sensitive graph grammar formalism for the specification of visual languages." The Computer Journal 44.3 (2001): 186-200.
- ISBN 978-1-4615-3622-2.
- ^ Patrick Blackburn and Kristina Striegnitz. "Natural Language Processing Techniques in Prolog".
- ^ Song-Chun Zhu. "Classic Parsing Algorithms".
- ISBN 0131103628. (Appendix A.13 "Grammar", p.193 ff)
Further reading
- Chapman, Nigel P., LR Parsing: Theory and Practice, ISBN 0-521-30413-X
- Grune, Dick; Jacobs, Ceriel J.H., Parsing Techniques - A Practical Guide, ISBN 0-13-651431-6
External links
- The Lemon LALR Parser Generator
- Stanford Parser The Stanford Parser
- Turin University Parser Natural language parser for the Italian, open source, developed in Common Lisp by Leonardo Lesmo, University of Torino, Italy.
- Short history of parser construction