Extensible Markup Language (XML)
XmlStreamReader - Simple and powerful
Lingfa Yang
-
XML Basic Concepts:
Element
| Name Convention
| Attributes
| Empty Element
| Nesting and depth
| Root
| PI
| Comments
| CDATA
| DTD
| Entities
-
Introduction to STL XmlStreamReader DOM | SAX | Stream parser ...
-
Capability of STL XmlStreamReader
-
STL XmlStreamReader Diff
-
Great Tolerance upon HTML and beyond
-
Application: Extract HTML text
-
Application: HTML validator
-
Application: XML validator
-
class StreamCacher : public StreamReader
-
Back-and-forth Bi-direction parsing
-
Other XML parsers
-
Related junk
XML Basic Concepts
-
Element
The building block of XML is element.
Each element has a name and a content.
<p>Hello XML</p>
There are tags. Call them "start tag" and "end tag" or "opening tag" and "closing tag".
When this element is read by a parser, tokens are:
start element
characters
end element
-
Element's Name Convention
-
Element names must start with either a letter or a underscore;
the rest contains of letters, digits, the underscore, the dot, or a hyphen.
-
Spaces are not allowed in names.
-
The colon ':' is reserved for namespaces.
-
XML tags are case-sensitive, while HTML tags are not.
-
Attributes
Element can have 0, 1 or more attributes. Attributes have name-value pairs,
where attribute names follow the same rules as elements names.
Attribute value must be enclosed in double or single quotation marks, for example,
The following two elements are equivalent.
<param color="r='0' g='128' b='255'" />
<param color='r="0" g="128" b="255"' />
|
-
Empty Element (self-closed element, or no content element)
<img src="image01.jpg"/>
or,
<img src="image01.jpg"></img>
Tokens are "start element" followed by "end element", no "characters" in between.
The following three examples are equivalent in XML:
<p/>
<p />
<p></p>
-
Nesting and depth
Nesting means element contains other element.
Depth means the steps to root.
-
Root
An XML document must have and only have one root.
-
Processing instructions (PI)
First line of an XML may have:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
A pi has an opening and closing tags, a target (name), and a data (content).
-
Comments
<!-- some comments -->
-
CDATA (character data - not markup)
<![CDATA[ ... ]]>
-
DTD (Document Type Definition)
<!DOCTYPE ... ]>
-
Entities escape
| Original | Escaped |
| less than (<) |
< |
| greater than (>) |
> |
| ampersand (&) |
& |
| apostrophe (') |
' |
| quote (") |
" |
List of XML and HTML character entity references
-
XML Stream Parser
Most XML parsers are DOM, SAX or SAX2 types.
DOM parser loads the whole tree in memory, it does not work for big file;
SAX parser which has O(0) complexity, scalable, but I don't like its passive data pushing and poor structure.
Let's work on a new parser for seeking simplicity, flexibility, and efficiency, and better structure.
That is a stream parser with data pulling mechanism base on token and zero cost for big XML parsing.
1) Three popular tokens:
-
START_ELEMENT
-
END_ELEMENT, and
-
CHARACTERS
2) One pulling method in your own driving loop
readNext();
3) One descent parsing pattern:
while (!reader->atEnd()) {
reader->readNext(); // pull data out
if (reader->isEndElement()) {
... // exit conditions
break;
}
else if (reader->isStartElement()) {
... // objects descent handling
}
else if (reader->isCharacters() {
std::string t = reader->text();
... // element content, or characters in between
}
}
|
From these three points you can image how simple the stream parse is!
Big advantage of the stream parser is to build a recursive descent object-orientated parsing process.
For example, in Office Open XML, a paragraph (p) contains a property list (pPr) and several runs;
each run (r) contains one run property list (rPr), text (t), or inline level of objects.
<p>
<pPr>
...
</pPr>
<r>
<rPr>
...
</rPr>
<t> ... </t>
</r>
<r>
<pict>
...
<txtbox>
<p>
...
</p>
</txtbox>
</pict>
</r>
</p>
To handle paragraph we have:
void Xxx::readParagraph()
{
while (!reader->atEnd()) {
reader->readNext();
if (reader->isEndElement()) {
if (reader->name() == "p") break;
}
else if (reader->isStartElement()) {
if (reader->name() == "pPr") {
readParagraphProperty();
}
else if (reader->name() == "r" ) {
readRun();
}
}
}
}
void Xxx::readRun()
{
while (!reader->atEnd()) {
reader->readNext();
if (reader->isEndElement()) {
if (reader->name() == "r") break;
}
else if (reader->isStartElement()) {
if (reader->name() == "rPr") {
readRunProperty();
}
else if (reader->name() == "t" ) {
readText();
}
else if (reader->name() == "pict") {
readPicture();
}
}
}
}
-
Capability of XmlStreamReader
-
Report tag name and attributes at
isStartElement()
if (rdr->isStartElement()) {
fstring name = rdr->name();
fstring prefix = rdr->prefix();
PairList a = rdr->attrs();
}
|
-
Report tag name at
isEndElement()
if (rdr->isEndElement()) {
fstring name = rdr->name();
fstring prefix = rdr->prefix();
}
|
-
Report text at
isCharacters(). Also, answers if the text isWhitespace().
if (rdr->isCharacters()) {
if (!rdr->isWhitespace()) {
fwstring text = rdr->text();
}
}
|
-
Report pi target and data.
if (rdr->isProcessingInstruction()) {
fstring target = rdr->name();
fwstring data = rdr->text();
}
|
-
Report comment, DTD, and CDATA by text()
if (rdr->isComment()
|| rdr->isDTD()
|| rdr->isCDATA() ) {
fwstring text = rdr->text();
}
|
-
Report Error:
if (rdr->hasError()) {
fstring errs = rdr->errstring();
}
|
-
Report open tag count and depth of nesting (distance to root)
int depth = rdr->depth();
int tagCount = rdr->tagCount();
|
-
Line and column numbers
while (!rdr->atEnd()) {
rdr->readNext();
size_t lineNumber = rdr->lineNumber();
size_t columnNumber = rdr->columnNumber();
}
|
For example,
<?xml version='1.0' encoding='utf-8' standalone='yes'?>
<body><p>
<!-- some comment -->
This is <i>something</i>.
</p>
</body>
|
(line, column) numbers of each token ends at (line number starts from 1, column number starts from 0)
-
Character offset
Instead of line/column numbers, character offset is known at every token as well:
-
Spaces handling
-
Reported space(s):
Any spaces sitting after a close angle '>' till a open angle '<' including line feed (line break or end-of-line / EOL character) get reported by
text()
-
Reduced/Abandoned space(s) between open angle '<' and close angle '>' :
-
All spaces sit after a tag name before the first attribute name are reduced to 1 space.
-
All spaces sit between attributes are reduced to 1 space.
-
Any spaces sitting around '=', before or after, are abandoned.
-
Public Member Functions
(link)
-
UTF-8 Encoding:
XmlStreamReader has a Source to read.
FileSource is a Source.
At the time to open a file, one can specify the encoding:
fopen_s(&fp, fileName.c_str(), "r, ccs=UTF-8");
|
Read wide char:
bool FileSource::read(uint16 &ch)
{
if (!fp) return false;
int tmp = ::getwc(fp); // must be a signed type.
if (tmp == WEOF) { // EOF == -1
ch = 0;
return false;
}
ch = tmp;
return true;
}
Debug view:
Tree view:
Notepad++ displays well in "Encoded in UTF-8 without BOM"
-
STL XmlStreamReader diff
-
Namespace declaration report
Our STL XmlStreamReader reports namespace declaration(s) in different way. For example,
<document xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" />
|
XmlStreamReader treats the namespace declaration nothing more than an attribute name-value pair, where
-
the prefix is "xmlns",
-
the local name is "w",
-
the qualified name is "xmlns:w",
-
the value is "http://...".
if (rdr->isStartElement()) {
if (rdr->hasAttrs()) {
StringPair pair = rdr->attrs().at(0);
fwstring prefix = pair.prefix(); // "xmlns"
fwstring name = pair.localName();// "w"
fwstring value = pair.second; // "http:..."
}
}
|
With QXmlStremReader
if (rdr.isStartElement()) {
QString tagName = rdr.name().toString();
QXmlStreamAttributes a = rdr.attributes();
int n = a.size();
QXmlStreamNamespaceDeclarations namespaces = rdr.namespaceDeclarations();
if (!namespaces.isEmpty()) {
// The QXmlStreamNamespaceDeclaration class is defined to be
// a QVector of QXmlStreamNamespaceDeclaration.
QXmlStreamNamespaceDeclaration ns = namespaces[0];
QString uri = ns.namespaceUri().toString(); // "http:..."
QString prefix = ns.prefix().toString(); // "w"
}
}
|
The qdoc says: "If the state() is StartElement, this function returns the element's namespace declarations. Otherwise an empty vector is returned."
Problem is how to know there comes a namespace declaration. We cannot try to get, for every start elements, a possible list of declarations, and then, check the vector is empty or not.
-
Valid/Invalid with/without namespace declaration
For example,
<!-- The following prefix is not declared -->
<w:p/>
|
-
It is invalid in QXmlStreamReader: "Namespace prefix 'w' not declared"
-
It is invalid in Xerces: "The prefix 'w' has not been mapped to any URI at row:2 col:7"
-
It is valid in QDOM, it is, also, valid in our STL XmlStreamReader. We're not stop parsing. Let it go!
-
What happen if an element has a repeated attribute?
For example,
-
STL XmlStreamReader has no validation on repeated attribute(s). It uses a list for attributes.
-
QDOM uses map, so it moves to the latest value.
-
QXmlStreamReader Error: "Attribute redefined."
-
Xerces stops parsing on such an error.
-
Attribute value restore/escape?
No, STL XmlSreamReader does not restore/escape anything. It's a RAW reader.
For examples, this is a valid XML where an attribute value is expressed in different ways.
<params>
<param color="r='0' g='128' b='255'" />
<param color='r="0" g="128" b="255"' />
<param color="r="0" g="128" b="255"" />
<param color='r='0' g='128' b='255'' />
</params>
|
STL XmlSreamReader reports these elements as what you see,
but normal readers, such as, QXmlSreamReader, Xerces SAX2, report them in restored format.
if (rdr.isStartElement()) {
QString tagName = rdr.name().toString();
QXmlStreamAttributes attributes = rdr.attributes();
foreach(QXmlStreamAttribute a, attributes) {
QString prefix = a.prefix().toString();
QString name = a.name().toString();
QString value = a.value().toString();
}
}
|
-
Characters Report difference
For example, in text editor an element like this:
<p>Italic font example: <i>italic></i></p>
|
STL XmlStreamReader reports the whole element content in one report, but QXmlStreamReader (version 4.4.1) reports 6 times.
They are:
1: "Italic font example: "
2: "<i"
3: ">italic"
4: ">"
5: "</i"
6: ">"
It is broken down everytime when an escaped character is encountered,
and represented as restored form, for example, "<" by "<".
Another example,
Tokens:
Four characters are:
"He said: "
""I"
"'ll come again."
"""
-
Great Tolerance upon HTML and beyond
Many HTML files are not well-formed XML, because of this reason, many XML parsers fail.
My STL XmlStreamReader has no such a problem. It can extract content or do searching smoothly among html files...
-
No problem if close tags are missing.
Very often in HTML some tags have open tags but no close tags, for examples,
<P>, LI, <BR>, <IMG>, <HR> in following HTML:
<BODY>
<P>The first paragraph has a list:
<UL>
<LI>First item
<LI>Second item
<LI>Last item
</UL>
<P>The second paragraph has <IMG SRC="google.png">image,
line break <BR> and horizontal line <HR>
</BODY>
|
Its token list is:
Token Line Column Offset Text
START_DOCUMENT 1 1 1
START_ELEMENT 1 6 6 BODY
CHARACTERS 2 1 8
START_ELEMENT 2 3 10 P
CHARACTERS 3 3 45 The first paragraph
START_ELEMENT 3 6 48 UL
CHARACTERS 4 5 54
START_ELEMENT 4 8 57 LI
CHARACTERS 5 5 73 First item
START_ELEMENT 5 8 76 LI
CHARACTERS 6 5 93 Second item
START_ELEMENT 6 8 96 LI
CHARACTERS 7 3 111 Last item
END_ELEMENT 7 7 115 UL
CHARACTERS 8 1 117
START_ELEMENT 8 3 119 P
CHARACTERS 8 29 145 The second paragraph
START_ELEMENT 8 50 166 IMG
CHARACTERS 9 12 186 image, line break
START_ELEMENT 9 15 189 BR
CHARACTERS 9 37 211 and horizental line
START_ELEMENT 9 40 214 HR
CHARACTERS 10 1 216
END_ELEMENT 10 7 222 BODY
END_DOCUMENT 10 7 222
-
Nesting problem is OK, for example, bold and italic are crossed.
<p>Some <b>bold <i>italic</b> text</i></p>
|
Its token list is:
Token Line Column Offset Text
START_DOCUMENT 1 1 1
START_ELEMENT 1 3 3 p
CHARACTERS 1 9 9 Some
START_ELEMENT 1 11 11 b
CHARACTERS 1 17 17 bold
START_ELEMENT 1 19 19 i
CHARACTERS 1 26 26 italic
END_ELEMENT 1 29 29 b
CHARACTERS 1 35 35 text
END_ELEMENT 1 38 38 i
END_ELEMENT 1 42 42 p
END_DOCUMENT 1 42 42
-
No problem if attribute values are not quoted, for example,
<p>Here is an image <img width=100 height=200> with unquoted attributes.
|
Its token list is:
Token Line Column Offset Text
START_DOCUMENT 1 1 1
START_ELEMENT 1 3 3 p
CHARACTERS 1 21 21 Here is an image
START_ELEMENT 1 46 46 img
CHARACTERS 1 73 73 with unquoted attri...
END_ELEMENT 1 76 76 p
END_DOCUMENT 1 76 76
-
No problem if attribute section has litter, for example,
<div class="bodysmall" float: left;" sdkjjf dfjsdhkfhsd>
Click above image for a token inspector demo, where as it browsers through every token in a list widget (right-hand), token string, token location (line, column numbers and character offset), and tags or selected text are displayed in bottom status bar, meanwhile, marked tokens are shown in a central text edit widget.
-
Back-and-forth Bi-direction parsing
River and stream flow one direction. Can XmlStreamReader reads back-and-forth in a bi-direction (bidi) manner? Yes.
Here is an example why we need bidi parsing.
The following xml contains two placeholders in a paragraph. Using placeholder is to void the normal paragraph flow not being interrupted. Real contents are attached in a separate section next the paragraph.
Our goal here is to restore the placeholder contents inplace:
Question is when a reader reaches the placesholder how to fetch the real content ahead and insert them in current paragraph?
You may use two-parse solution: collect information and cache them in memory in the first parse,
and restore them in the second parse. The disadvantages are parsing twice wastes time, and caching info costs memory.
Is there a better solution fast and no memory cost? Yes.
To make the XmlStreamReader as a bidi parser, a derived class: XmlBidiReader provides methods:
class XmlBidiReader : public XmlStreamReader
{
public:
XmlBidiReader();
enum TokenType readNext();
size_t tokenCount() const {return m_tokenCount;}
bool backTo(size_t tokenCount);
void lock();
void unlock();
};
|
Here is a parsing flow.
A reader reads everything. When a placeholder is meet, do fetch:
fwstring XmlBidiTest::readText()
{
fwstring id = rdr->attrs()[L"id"];
fwstring displayText;
fwstring talkText;
while (!rdr->atEnd()) {
rdr->readNext();
if (rdr->isEndElement()) {
if (rdr->name() == "t") break;
}
else if (rdr->isCharacters()) {
displayText = rdr->text();
talkText = fetch(id);
}
}
return talkText;
}
|
fetch() has a locker at entrance. When fetch() is done, it repositions the reader back to the entrance, and continue on parsing rest of the paragraph.
fwstring XmlBidiTest::fetch(const fwstring & id)
{
XmlBidiLocker locker(rdr); // return to current token
bool inTalks = false; // inside talks section
while (!rdr->atEnd()) {
rdr->readNext();
if (rdr->isStartElement()) {
if (rdr->name() == "talks") {
inTalks = true;
continue;
}
if (inTalks) {
if (rdr->name() == "t" && rdr->attrs()[L"id"] == id) {
return readContent();
}
}
}
}
return L""; // not found
}
|
Implementation details:
a locker on XmlBidiReader:
class XmlBidiLocker
{
public:
XmlBidiLocker(XmlBidiReader *rdr);
virtual ~XmlBidiLocker();
private:
XmlBidiReader *rdr;
};
|
The locker simply make a pair of two methods, lock() and unlock(), in its constructor and destructor.
XmlBidiLocker::XmlBidiLocker(XmlBidiReader *rdr)
: rdr(rdr)
{
rdr->lock();
}
XmlBidiLocker::~XmlBidiLocker()
{
rdr->unlock();
}
|
"lock" means push a token into a stack,
void XmlBidiReader::lock()
{
locks.push(m_tokenCount);
}
|
and "unlock" means pop a token out, and repositions the reader.
void XmlBidiReader::unlock()
{
if (!locks.empty()) {
this->backTo(locks.top());
locks.pop();
}
}
|
Repositioning looks up a map:
/// Back to a previously read token.
bool XmlBidiReader::backTo(size_t n)
{
std::map <size_t, Position> ::const_iterator i = token2pos.find(n);
if (i == token2pos.end()) return false; // not found
m_tokenCount = n;
m_ch = i->second.ch;
if(m_atEnd) m_atEnd = false;
return m_src->jumpTo(i->second.line, i->second.col, i->second.off);
}
|
and the map is built in one, and the only one, token consuming method: readNext()
enum TokenType XmlBidiReader::readNext()
{
enum TokenType token = XmlStreamReader::readNext();
if (token != INVALID_TOKEN) {
++ m_tokenCount;
token2pos[m_tokenCount] = Position(lineNumber(), columnNumber(), characterOffset(), m_ch);
}
return token;
}
|
This solution is fast, because it parses once, not twice.
This solution is scalable, because there is no data caching, no any content put into memory for later use.
-
Application: Extract HTML text
/*!
Parse and extract text. Has os (Output Stream) if there is writer.
*/
void HtmlTextReader::parse()
{
while (!rdr->atEnd()) {
rdr->readNext();
if (rdr->isStartElement()) {
if (rdr->name() == "script") {
rdr->readRawTillEnd("script");
}
if (needLineBreak(rdr->name())) {
if (os) *os << endl;
}
}
else if (rdr->isCharacters()) {
if (!rdr->isWhitespace()) {
if (os) *os << sfm::trimmed(rdr->text());
}
}
}
if (rdr->hasError()) {
if (os) *os << endl << rdr->errstring();
}
}
|
-
Application: HTML validator:
-
Application: XML validator
-
XML tree
If you ask a tree from a stream reader, you get a tree.
if (rdr->isStartElement()) {
if (rdr->name() == "rPr") {
TokenNode *rPr = rdr->tree();
}
}
more ...
-
class StreamCacher : public StreamReader
-
A cache reader, which remains the same flow as primary StreamReader,
beyond, it caches internally.
-
One can specify which tag uses cache.
-
One can stop cache by
setNoCache().
-
One can read cache the same way as a stream reader by taking a reference StreamCacher.
StreamCacher r;
StreamReader * reader = new StreamCacher(r);
|
-
One can reset to anywhere to start reading as a stream.
-
Possible to structuralized as a tree, for example,
TokenNode * root = StreamCacher(r).tree();
|
-
Other XML parsers
Xerces
/ Xalan
/ VisualParser
?
Use cannon, fire a missile, target a fly?
My stream parser is simple, light-weight, faster, and highly portable (implemented by C++ STL, Windows/Linux platform-independent),
with big advantage of data pulling approach, which allows you to build recursive descent parsing patterns.
-
Related junk
-
Extensible Stylesheet Language Transformations (XSLT) - It operates on XML sources to produce XSL-FO, other XML, or HTML.
-
XLS-FO (Formatting Objects) - A presentation format usually generated by XSL transformations.
-
XPath - A query language used by XSLT to access the different information items that compose XML documents
-
International Organization for Standardization (ISO) - a non-governmental organization, acts as a consortium with strong links to governments.
-
ISO Schematron - Schematron is a rule-based validation language for making assertions about the presence or absence of patterns in XML trees. It typically uses XPath to describe patterns.
References:
-
XML 1.0 (5th edition, 26 November 2008)
-
QXmlStreamReader
-
Processing XML with Xerces and the DOM
-
Processing XML with Xerces and SAX
XmlStreamReader | Tree
|