Extensible Markup Language (XML)

class StreamCacher : public StreamReader

Lingfa Yang

Inspired by

water_bucket_playground_spill_fun


  1. We first make StreamCacher behave exactly the same as StreamReader.

    It must pass the following test:
    bool streamCacher_test()
    {
      fstring fileName = "c:/document.xml";
    
      StreamReader r0;
      StreamCacher r1;
    
      if (!r0.readFile(fileName)) return false;
      if (!r1.readFile(fileName)) return false;
    
      while (!r1.atEnd() && !r0.atEnd()) {
    
        // Read the same types of tokens
        if (r1.readNext() != r0.readNext()) return false;
    
        // Start the same element, same attribute set
        if (r1.isStartElement()) {
          if (!r0.isStartElement()) return false;
          if (r1.name() != r0.name()) return false;
          if (r1.attrs() != r0.attrs()) return false;
        }
    
        // End the same element
        else if (r1.isEndElement()) {
          if (!r0.isEndElement()) return false;
          if (r1.name() != r0.name()) return false;
        }
    
        // read the same element content
        else if (r1.isCharacters()) {
          if (!r0.isCharacters()) return false;
          if (r1.text() != r0.text()) return false;
        }
      }
    
      // Yes, they do behaves the same.
      return true;
    }
    

  2. The stream, both in and out looking from external, are continuous; while, the internal is quantized, like periodically spilling of a water bucket in a water playground.

    Here, the water bucket is a token vector:
      std::vector <TokenInfo> tokens;
    

  3. load

    As the normal stream flowing, check to load.
    enum TokenType StreamCacher::readNext()
    {
    ...
      enum TokenType type = StreamReader::readNext();
      if (isStartElement()) {
        if (name() == cacheTag) {
          return load();
        }
      }
    ...
    }
    
    Loading pushes token into a vector.
    enum TokenType StreamCacher::load()
    {
      int32 count = 0;
      while (!this->atEnd()) {
        if (isEndElement()) {
          if (name() == cacheTag) {
            -- count;
            if (!count) {
              tokens.push_back(TokenInfo(*token)); // store the last token
              break;
            }
          }
        }
        else if (isStartElement(cacheTag)) ++ count;
        tokens.push_back(TokenInfo(*token)); 
        StreamReader::readNext();
      }
      token0 = token; // back up
      idx = 0;
      return next();
    }
    
  4. next

    When full, spill it (step it one-by-one = next)
    enum TokenType StreamCacher::next()
    {
      token = &tokens[idx]; ++ idx;
      enum TokenType type = token->type;
      if (idx == tokens.size()) {
        tokens.clear();
        token = token0;
      }
      return type;
    }
    
  5. water_bucket_playground_spill_fun

    enum TokenType StreamCacher::readNext()
    {
      if (idx < tokens.size()) {
        return next();
      }
      enum TokenType type = StreamReader::readNext();
      if (isStartElement()) {
        if (name() == cacheTag) {
          return load();
        }
      }
      return type;
    }
    
  6. Make fun

    Are you surprise getting content of paragraph at the moment you just read an openning p (Paragraph) tag?

    Normally, when you finish reading a paragraph, you know its content. Here is a paragraph reader:
    fwstring readParagraph(StreamReader *r)
    {
      if (!r->isStartElement("p")) return L"";
      fwstring text;
      while (!r->atEnd()) {
        r->readNext();
        if (r->isEndElement("p")) break;
        if (r->isStartElement("t")) {
          text += r->readContent();
        }
      }
      return text;
    }
    
    With such a StreamCacher, you know the whole content when you just start to read. Here it is:
      while (!r.atEnd()) {
        r.readNext();
        if (r.isStartElement("p")) {
          fwstring content = r.content();
        }
      }
    
    You keep awaring the whole content during reading, and suddenly "forget" everything at the end of a paragraph (meet closing p tag).
      while (!r.atEnd()) {
        r.readNext();
        if (r.isEndElement("p")) {
          fwstring nothing = r.content(); // too late to know
        }
      }
    
    The secret is the "water mucket", which quantizes the continuous stream.
    fwstring StreamCacher::content()
    {
      fwstring text;
      std::vector <TokenInfo>::const_iterator i,
        b = tokens.begin(),
        e = tokens.end();
      for(i = b; i != e; ++ i) {
        if (i->type == CHARACTERS) {
          if ( (i-1)->type == START_ELEMENT 
            && (i-1)->name == "t") {
            text += i->text;
          }
        }
      }
      return text;
    }
    

  7. Example

    Input XML file:

    Expect Output:
    Main Street has 5 homes.
            1 out of 5 is George Washington.
            2 out of 5 is John Adams.
            3 out of 5 is Thomas Jefferson.
            4 out of 5 is James Madison.
            5 out of 5 is James Monroe.
    Liberty Street has 2 homes.
            1 out of 2 is George W. Bush.
            2 out of 2 is Barack Obama.
    

    Code:
    bool streamCacher_read()
    {
      fstring fileName = "c:/yanglingfa/xml/concord.xml";
      StreamCacher r;
      if (!r.readFile(fileName)) return false;
      r.setCacheTag("street"); // Specify cached tag name
    
      StreamWriter * os = new StreamConsole;
      EndOfLine endl;
    
      fwstring streetName;
      uint32 numberOfHome;
      uint32 count;
      while (!r.atEnd()) {
        r.readNext();
        if (r.isStartElement()) {
          if (r.name() == "street") {
            streetName = r.attrs()[L"name"];
            // Construct a new StreamCacher, and use StreamReader to read.
            numberOfHome = homeCount(&StreamCacher(r)); // count in advance !!!
            *os << streetName << " has " << numberOfHome << " homes." << endl;
            count = 0;
          }
          else if (r.name() == "home") {
            ++ count; // count in stream
            *os << "\t" << count << " out of " << numberOfHome 
                << " is "<<  r.attrs()[L"name"] << "." << endl;
          }
        }
      }
    
      delete os;
      return true;
    }
    
    where, still use StreamReader interface read token cache.
    uint32 homeCount(StreamReader * r)
    {
      uint32 count = 0;
      while (!r->atEnd()) {
        r->readNext();
        if (r->isStartElement("home")) ++ count;
      }
      return count;
    }
    

    Behand the scene in storage, there are 22 tokens:

    One can reset to any token to read;
        StreamCacher r1(r);
        size_t size = r1.size(); // 22
        bool ok = r1.reset(11); // reset to a place you want
        fstring name = r1.name(); // apartment
        bool isStartApt = r1.isStartElement("apartment"); // true
        r1.readNext();
        bool isCharacter = r1.isCharacters(); // true
    

        r1.reset();  // reset to idx = 0 as default
        fstring name1 = r1.name(); // "street"
        size_t count1 = 1;
        while (!r1.atEnd()) {
          r1.readNext();
          ++ count1;
        }
        // Is should end up at 22
    

  8. Tree ?

    If you prefer to work on a structure tree instead of these cached token as a vector, or over such a stream reader interface, you can code this way:
      TokenNode * st = StreamCacher(r).tree();
    
    Then, you can see a tree with only 7 start element tokens. The tree can also hold text nodes if it has, but tokens for indent characters and end elements are not seen in such a tree structure. That is why we use a vector to collect all tokens, not use tree for storage, and we use tree for representation purpuse only.

    As you can see, this tree has rootName:
      fstring rootName = st->info.name; // "street"
    
    Ask a tree node for its size means numbers of children the node has. Here, st has 4 children. Its 4th child has 2 children.
      size_t children = st->size(); // 4
      size_t grandChildren = st->child(3)->size(); // 2
    


StreamReader