Published on 08/28/2024 14:00 by Stephen Horne

JSON and the Golden Fleece, Pt 5

Hey look at this commit. What do you see? Yes, it’s a bunch of very ugly text. Those are C++ source files. I’m so sorry. But wait! What this ugly text compiles into is a working JSON parser! I still need to do some more testing, and refactor it because it could be less repetitive and defensive (JSON payloads are a scary landscape when you can only see one byte at a time), and improve efficiency here and there with some more helpful, actionable error messages, but it works!

It works on all six of the JSON data types: Object, Array, String, Null, Number and Boolean. It works when nesting an Array in an Object, or an Object in an Array, and, because of my “clever” algorithm, it keeps context regardless of how deep it is in a payload. It also uses proper C++ ownership and move semantics, saving memory and keeping pointers from going rogue and leaking memory. It uses a polymorphic tree to store all the parsed data. It doesn’t waste parsing buffer space on, well, spaces. It appears to be pretty fast, though I’ll have to benchmark with some big JSON payloads.

Compile time is a little long but apparently GCC’s linker is just slow? I read about using foreward declarations instead of #include statements to improve compile time, but I couldn’t find any real opportunities in my project to do so. Once it’s compiled, though, it runs all the tests in the blink of an eye. Good enough for now.

How It Works

The algorithm uses a loop. That is, a while loop that will go on forever until the stream providing the character bytes reaches EOF. Within this loop, after an EOF check, is a very large switch. Every character of interest has its own case, with many of them falling through for grouped characters like digits and those that make up true, false and null. As these “primitive” characters are encountered, they are added to a buffer until an event happens which signals that it’s time to makes sense of those bytes.

static void collapse(StackType& stack, const Type type) {
    if (type != Type::Array && type != Type::Object)
        throw std::runtime_error("JSON::collapse called with non-container Type enum");

    StackType temp;

    while (!stack.empty() && stack.top().second->getType() != type) {
        temp.push(std::move(stack.top()));
        stack.pop();
    }

    if (stack.empty())
        throw std::runtime_error("did not find beginning of container when collapsing");

    if (type == Type::Object) {
        auto objectPtr = static_cast<ObjectNode*>(stack.top().second.get());

        while (!temp.empty()) {
            auto tempTop = std::move(temp.top());
            if (tempTop.first.empty())
                throw std::runtime_error("empty key encountered while collapsing Object");

            objectPtr->insert(std::move(tempTop.first), std::move(tempTop.second));

            temp.pop();
        }
    } else {
        auto arrayPtr = static_cast<ArrayNode*>(stack.top().second.get());

        while (!temp.empty()) {
            auto tempTop = std::move(temp.top());
            if (!tempTop.first.empty())
                throw std::runtime_error("non-empty key encountered while collapsing Array");

            arrayPtr->insert(std::move(tempTop.second));

            temp.pop();
        }
    }
}

Above is my collapse procedure that is triggered when a } or ] is encountered. More on that in a moment.

The “well, what do we have here?” events are all triggered by encountering specific characters, except for EOF, which is just a bit somewhere in the bowels of a stream object. The characters are:

{ and }: push or pop a pointer to an ObjectNode object on or off the stack, respectively. To wrap up the Object’s contents (after parsing or throwing on what’s in the primitive value buffer and adding it to the stack), we perform what I have named a collapse procedure (included above): take pointers off the stack, putting them in a temporary stack, until you reach a ObjectNode pointer. Throw an error if one is never found, or if any of the pointed-to values are missing a key (should not be possible given how the stack is managed, but I’m keeping it there just in case I’m wrong). When the ObjectNode pointer is found, take each key-value pair off the temporary stack and put them in the Object. Then, we continue parsing, the Object having been properly collapsed.
[ and ]: the same as { and }, but if any of the items on the stack found before the Array pointer is found have a key, we throw. Again, I do not intend for this to be possible as items put on the main stack after an ArrayNode has been placed there specifically do not have a key, and any subsequent items in either an ObjectNode or ArrayNode sub-stack follow the pattern of the item before it. This pattern validates the payload. I don’t know the name for it, but this concept of using a strictly-enforced data structure for storing parsed data as a form of validation of the incoming data itself (e.g. the last item on the stack is an ArrayNode pointer, and we’re encountering a :, so this payload is clearly invalid). I originally began using the JSON tree itself as a validation method but the stack is simpler, and the collapse procedure is fun and seems to work well.
,: If the stack is empty, throw. Commas don’t make sense outside of an Object or Array. If there’s something in the primitive value buffer, try to parse and clear it if it’s valid or throw if it’s not. If there’s a non-empty key with a nullptr value on the stack (we’re in an Object), move a new unique_ptr which owns a new ValueNode containing our freshly-parsed data to the value. If it’s a keyless value (we’re in an Array), push this new unique_ptr to the stack and continue parsing. (needs more testing around edge cases, as I feel this is relying too much on the parsing events following it to validate comma usage…)
:: Since I mentioned it above. Here, we check that the stack has a non-empty key with a nullptr value. If not, we throw. If we’re not expecting a value to match with a key, this character should never be used. “But what about strings and the keys themselves!?” you are likely screaming at the top of your lungs at your local coffeeshop right now. First of all, please stop screaming. You’re making a scene. Second, let me give : its due, for it’s important in making JSON so useful. If we find that we, indeed, are seeking a value, we simply proceed with parsing. Nice and simple.
": Okay. If you were one of the people who blew your top a moment ago (you, and the rest of the shop patrons, know who you are) over : being strictly verboten let me tell you about how strings and keys are parsed (code below). Strings, regardless of their function and as long as they are not in an unexpected place, are collected from the stream the same way: we continue the stream from the first " and stop at the next one (TODO add support for escaped "s), adding each byte along the way (see? there’s no mention in the code of a :. Feel better?). Then, if the stack is not empty and we’re ready for an object key (we have reason to believe, based on the parser’s internal state, that we’re in an object and don’t have any dangling keys), we remove the beginning and ending " characters and put the string on the stack as a key, ready to be paired with its one true lov.. I mean its value. If we’re not ready for a key, we simply leave the string in the buffer and continue parsing. The string is later dealt with when a parsing event occurs (,, }, ], EOF). We keep the " around it so the parsing function knows it’s a string and not just illegal data floating where it doesn’t belong, masquerading as a primitive value.

case '"':
    if (!parsingBuffer.empty())
        throw std::runtime_error("unexpectd double-quote (\")");

    parsingBuffer.push_back('"');
    while (stream.get(byte)) {
        parsingBuffer.push_back(byte);
        if (byte == '"')
            break;
    }

    if (!stack.empty()) {
        if (readyForObjectKey(stack)) {
            parsingBuffer.erase(parsingBuffer.begin());
            parsingBuffer.erase(parsingBuffer.end() - 1);
            stack.push(
                std::make_pair(
                    parsingBuffer,
                    std::move(std::unique_ptr<ValueNodeBase>{}))
            );
            parsingBuffer.clear();
        }
    }