Java XML Parsing: SAX vs DOM
The general rule of thumb, when processing XML, is to use DOM on document like files (e.g. html, feeds or settings files), and SAX on large files, XML dump files, or data streams, and such. But lately I have been quite curious on how much the performance difference would be between the two.
So I've created a small micro benchmarks; just default normal java XML processing. No fancy settings or anything. One implementation uses DOM, the other SAX. Both try to retrieve the main feed title from an atom feed, and both are implemented quite simplistic. Basically both are a "use the first title you find", for as far as that maps to the respective APIs.
See the resulting graph:
Clearly, on small documents it hardly matters, but as soon as the XML document size increases SAX is a lot more efficient. To process a 1 Mb document, DOM has to allocate about 6 Mb of objects. (Notice though, that, probably, by the time you get to traverse the nodes, some of that 6 Mb has already turned into garbage, so total memory need is actually less then 6 Mb, more on that below.)
The difference is staggering, though somewhat expected: DOM takes twice the time, and 5 times the memory compared to SAX. Basically SAX scales better, as the rule of thumb predicts.
Implementing
Implementing a SAX parser, however, is of course cumbersome compared to a DOM parser. Because with DOM you can randomly walk the document at your own means, and you kan even keep the Document node around to reference whenever you need to.
Contrast that with SAX, where you pass through the document only ones, and you have be done when the document ends. So likely with SAX, you need to build your own set of objects representing the interesting information that was in the XML. The upside is that your own set of objects is easier and more efficient to reference and traverse, then all those DOM Nodes and NodeLists.
Benchmark Implementation
A quick note on how I did the benchmark measurements. Because it is actually quite difficult to get absolute memory usage, and relative cpu times from java. For cpu times I'm just using the user part of *nix time utility. Taking the average of three runs.
For memory figures, I run the test using java -agentlib:hprof=heap=sites <Test>, and sum all the allocated bytes (6th row in java.hprof.txt in SITES BEGIN section). Somewhat similar as if you would count all sizes in calls to malloc (and friends) in c. Notice this says nothing about total memory use at any point in time, maybe allocated objects are also released quickly.
So in total I run every java test 4 times, one time with profiling enabled, using the following bash snippet (on MacOSX):
TCPU=""
for ((i = 0; i < $RUNS; i++)); do
T=`/usr/bin/time -p java $TEST $FILE 2>&1 | awk -F' ' '/user/ { print $2 }' `
TCPU="$TCPU $T"
done
java -agentlib:hprof=heap=sites $TEST $FILE
MEM=`cat java.hprof.txt | awk '/%.*%/ { total += $6 }; END { print total }'`
Download the test files (60Kb).
Last modified: 2007-11-19 20:17 GMT