Simple API for XML
This article needs additional citations for verification. (August 2008) |
SAX (Simple API for XML) is an
Definition
Unlike
Benefits
A SAX parser only needs to report each parsing event as it happens, and normally discards almost all of that information once reported (it does, however, keep some things, for example a list of all elements that have not been closed yet, in order to catch later errors such as end-tags in the wrong order). Thus, the minimum memory required for a SAX parser is proportional to the maximum depth of the XML file (i.e., of the XML tree) and the maximum data involved in a single XML event (such as the name and attributes of a single start-tag, or the content of a processing instruction, etc.).
This much
Because of the event-driven nature of SAX, processing documents is generally far faster than DOM-style parsers, so long as the processing can be done in a start-to-end pass. Many tasks, such as indexing, conversion to other formats, very simple formatting and the like can be done that way. Other tasks, such as sorting, rearranging sections, getting from a link to its target, looking up information on one element to help process a later one and the like require accessing the document structure in complex orders and will be much faster with DOM than with multiple SAX passes.
Some implementations do not neatly fit either category: a DOM approach can keep its
Due to the nature of DOM, streamed reading from disk requires techniques such as lazy evaluation, caches, virtual memory, persistent data structures, or other techniques (one such technique is disclosed in US patent 5557722). Processing XML documents larger than main memory is sometimes thought impossible because some DOM parsers do not allow it. However, it is no less possible than sorting a dataset larger than main memory using disk space as memory to sidestep this limitation.[4]
Drawbacks
The event-driven model of SAX is useful for XML parsing, but it does have certain drawbacks.
Virtually any kind of
Additionally, some kinds of XML processing simply require having access to the entire document. XSLT and XPath, for example, need to be able to access any node at any time in the parsed XML tree. Editors and browsers likewise need to be able to display, modify, and perhaps re-validate at any time. While a SAX parser may well be used to construct such a tree initially, SAX provides no help for such processing as a whole.
XML processing with SAX
A
- XML Text nodes
- XML Element Starts and Ends
- XML Processing Instructions
- XML Comments
Some events correspond to XML objects that are easily returned all at once, such as comments. However, XML elements can contain many other XML objects, and so SAX represents them as does XML itself: by one event at the beginning, and another at the end. Properly speaking, the SAX interface does not deal in elements, but in events that largely correspond to tags. SAX parsing is unidirectional; previously parsed data cannot be re-read without starting the parsing operation again.
There are many SAX-like implementations in existence. In practice, details vary, but the overall model is the same. For example, XML attributes are typically provided as name and value arguments passed to element events, but can also be provided as separate events, or via a hash table or similar collection of all the attributes. For another, some implementations provide "Init" and "Fin" callbacks for the very start and end of parsing; others do not. The exact names for given event types also vary slightly between implementations.
Example
Given the following XML document:
<?xml version="1.0" encoding="UTF-8"?>
<DocumentElement param="value">
<FirstElement>
¶ Some Text
</FirstElement>
<?some_pi some_attr="some_value"?>
<SecondElement param2="something">
Pre-Text <Inline>Inlined text</Inline> Post-text.
</SecondElement>
</DocumentElement>
This XML document, when passed through a SAX parser, will generate a sequence of events like the following:
- XML Element start, named DocumentElement, with an attribute param equal to "value"
- XML Element start, named FirstElement
- XML Text node, with data equal to "¶ Some Text" (note: certain white spaces can be changed)
- XML Element end, named FirstElement
- Processing Instruction event, with the target some_pi and data some_attr="some_value" (the content after the target is just text; however, it is very common to imitate the syntax of XML attributes, as in this example)
- XML Element start, named SecondElement, with an attribute param2 equal to "something"
- XML Text node, with data equal to "Pre-Text"
- XML Element start, named Inline
- XML Text node, with data equal to "Inlined text"
- XML Element end, named Inline
- XML Text node, with data equal to "Post-text."
- XML Element end, named SecondElement
- XML Element end, named DocumentElement
Note that the first line of the sample above is the XML Declaration and not a processing instruction; as such it will not be reported as a processing instruction event (although some SAX implementations provide a separate event just for the XML declaration).
The result above may vary: the SAX specification deliberately states that a given section of text may be reported as multiple sequential text events. Many parsers, for example, return separate text events for numeric character references. Thus in the example above, a SAX parser may generate a different series of events, part of which might include:
- XML Element start, named FirstElement
- XML Text node, with data equal to "¶" (the Unicode character U+00b6)
- XML Text node, with data equal to " Some Text"
- XML Element end, named FirstElement
See also
- Expat (XML)
- Java API for XML Processing
- LibXML
- List of XML markup languages
- List of XML schemas
- MSXML
- RapidJSON - a SAX-like API for JSON
- StAX
- Streaming XML
- VTD-XML
- Xerces
- XQuery API for Java
References
- ^ a b "SAX". webopedia.com. WEBOPEDIA. Retrieved 2011-05-02.
Short for Simple API for XML, an event-based API that, as an alternative to DOM, allows someone to access the contents of an XML document. SAX was originally a Java-only API. The current version supports several programming language environments other than Java. SAX was developed by the members of the XML-DEV mailing list.
- ^ "saxproject.org".
- ^ "Simple API for XML". oracle.com. ORACLE. Retrieved 2011-05-02.
Note: In a nutshell, SAX is oriented towards state independent processing, where the handling of an element does not depend on the elements that came before. StAX, on the other hand, is oriented towards state dependent processing. For a more detailed comparison, see SAX and StAX in Basic Standards and When to Use SAX.
- ^ "XML Parsers: DOM and SAX Put to the Test". devX. Retrieved 2011-10-20.
Although these tests do not show it, SAX parsers typically are faster for very large documents where the DOM model hits virtual memory or consumes all available memory.
Further reading
- Brownell, David (2002). SAX2. O'Reilly. ISBN 0-596-00237-8.
- Means, W. Scott; Bodie, Michael A. (2002). The Book of SAX. No Starch Press. ISBN 1-886411-77-8.