Published on 08/08/2024 16:00 by Stephen Horne
JSON and the Golden Fleece
First of all, sorry for the title. It’s not even a pun, really having no meaning besides that JSON sounds like Jason, and the only other Jason I thought of immediately was the, also mythical, Jason Voorhees. And now my mind is linking both Jasons (you could say they’re doubly-linked, even) yet again because Voorhees is a Dutch surname, and the Dutch had a colonial empire that spread naut-ically, and the mythical Argo-nauts were sailors. Further, it was in a body of water that the slasher film villain met his demise before rising from the dead to exact vengeance on… whom exactly? Don’t think I can’t keep going.
Today’s code challenge is to create a JSON parsing program. Interesting, because
- You don’t know the structure of the object before you parse it
- You don’t know what the type/types is/are
- You should not accept invalid types or syntax
- Should you start parsing, and THEN deliver a verdict on the validity of the input, or do a first pass of the byte stream to validate it first?
- The data can be either very small or very large
- Once parsed, the parsed data should be re-serializable
- Could just hold a pointer to the original data, if it’s available for the life of the program, though that doesn’t help when all you want is a subset, so screw that
Let me take a moment to thank cppreference.com because I’d be pretty lost without that wonderful website trying to deal with this huge, often unfriendly language.
Anyway, I’m starting with this, namespaced under JSON:
enum class NodeType {
ARRAY_TYPE,
BOOLEAN_TYPE,
NULL_TYPE,
NUMBER_TYPE,
OBJECT_TYPE,
STRING_TYPE
};
using Array = std::vector<std::unique_ptr<Node>>;
using Boolean = bool;
using Null = std::nullptr_t;
using Number = double;
using Object = std::unordered_map<std::string, std::unique_ptr<Node>>;
using String = std::string;
using Value = std::variant<
Array,
Boolean,
Null,
Number,
Object,
String
>;
class Node {
NodeType type;
Value value;
public:
Node(NodeType type, Value value) : type(type), value(value) {}
};
My reasoning here is:
- I do not want to rely on reflection or other costly operations to determine the type of a given node.
- I don’t care about the order of the original JSON, as the ordering is not part of the spec, but strictly incidental as data is inherently serial, and the underlying structure of
std::unordered_map
is a hashmap, and not a red-and-black tree as withinstd::map
.. those sorting operations aren’t free, and I don’t need them. - All numbers in JSON are double-precision floating-point integers, so there’s a 1-to-1 there with C++‘s
double
type - I know that
null
in JSON does not have the same meeting as anullptr
in C/C++, but for now it’s what I’ll use until I find something more suitable. - Value objects need to store any value that’s legal JSON, and the way they’re serialized or operated on later will be determined by the enum value
type
, so reflection isn’t necessary to know how to treat them later on. Enumerations are cheap, AFAIK, compared to asserting on types at runtime. The code will also be cleaner and nicer to read. - I aliased types so their naming matches the JSON spec type names, and so the JSON types all are namspaced. It makes for, I think, nicer-looking and therefore easier-to-read code
These assumptions may be challenged, which is fine. Planning is its own skill, and one that is refined by being wrong, admitting it, and remembering the lesson next time. I’d be concerned if I were consistently 100% correct about all my assumptions. Either I’m dreaming, or my head is too far up my posterior to even recognize my own mistakes. Yikes. That said, sometimes I’m right and that’s that, but I never immediately trust it.
Updates to come as I continnue my implementation. Thanks for reading this.
UPDATE [20240812]
Oof. This is proving to be quite difficult. I’ve had a few false starts. A lot of it is just unfamiliarity with the language, and long, hard-to-read complaints from the compiler (gcc) when I introduce copy constructors to the polymorphic Value
class. I think it has to do with the std::variant
instance and the uncopiable nature of some of its members. I realized that I had painted myself into a corner, so I started over and started with a simple little test program:
#define COLOR_RED "\033[1;31m"
#define COLOR_GREEN "\033[1;32m"
#define RESET_COLOR "\033[0m"
#define TEST(description, func)\
if (!func()) {\
std::cout << COLOR_RED << "FAIL: " << description << RESET_COLOR << std::endl;\
return 1;\
}
int main() {
TEST("empty payload is not valid JSON", []() {
std::istringstream ss("");
auto json = JSON::JSON::parse(ss);
return !json.isValid();
})
std::cout << COLOR_GREEN << "All tests passed!" << RESET_COLOR << std::endl;
return 0;
}
This way, I’m not over-implementing and getting stuck on a bunch of compiler errors before a single feature is completed. I had a bunch of tests for a bunch of different values, but found that, again, I was putting the cart too far before the horse. Meanwhile, I realized that I needed to have some kind of context around what the prior significant character or group of characters was at any given point to assert the validity of the payload:
enum class Lexeme {
Colon,
Comma,
DecimalPoint,
Digit,
DoubleQuote,
LeftBrace,
LeftSquareBracket,
None, // signaling the start of the parsing process
Null,
RightBrace,
RightSquareBracket
};
The algorithm itself is proving to be quite challenging. I think that, perhaps, I am trying to optimize prematurely (wasn’t there a saying about premature optimization being the root of all evil?). I’ll get a test to pass in whatever way “works,” then refactor as I get the next test to pass. I’m going to stop writing tests beyond the next one to assert on the next bit of behavior I want. This is something I believe I have struggled with throughout my career: discipline.
What I mean to say is that I typically just go ahead and start writing a bunch of code until I get nervous that it might not be working the way I think it is, then start testing it. Most of the time, this approach works. I can implement a pretty complex feature sometimes without running its code once, what I think of in my head as a “hole in one.” This, I suppose, “talent” of mine is also a weakness as it means that I have to be extra intentional about breaking more complex problems up into pieces.
I don’t just struggle with this in software, but in music, too. I know it’s sort of a humblebrag, but being talented at something can sometimes lead to problems, especially when you’re telf-taught as I am in both software and in music. I started playing the guitar and piano in my early teens, and by the time I had turned 16 I was playing at an advanced level, receiving praise from people decades older than me for my ability and the “maturity” of my playing. I was not a dilligent practicer, just having a natural ability. This proved to be problematic when I would encounter a piece that was too far beyond my skill level to just “figure it out,” as I had not cultivated the discipline required to break problems up into pieces. I could punch well above my weight, but that did not mean I had cultivated true “maturity” in my practice routine. Aaaaaaand here I am again.
This project, in a language I am not an expert in and solving a set of problems I have not yet solved (I’ve never written a lexer before, for example), is a good opportunity for me to slow down and exercise discipline in breaking things up, carefully iterating and using TDD to make steady, solid progress. The other coding challenges I’ve completed in C++ so far have been familiar enough that it was just an issue of finding the “C++ equivalent of x or y” in the standard library, but now I’m going to have to really exercise self-control and take my time.
More to come as progress is made, and lessons are learned. До свидания, друзья мои!
UPDATE [20240813]
Implementation is going smoothly. Wrote a little test program. Here’s a little chunk of it:
// some valid examples
DESCRIBE("arrays",
TEST("singleton array with string literal '[\"\"]' is valid JSON", []() {
std::istringstream ss("[\"\"]");
auto json = JSON::JSON::parse(ss);
return json.isValid();
})
TEST("two-element array with two string literals '[\"\",\"\"]' is valid JSON", []() {
std::istringstream ss("[\"\",\"\"]");
auto json = JSON::JSON::parse(ss);
return json.isValid();
})
TEST("empty array is valid JSON", []() {
std::istringstream ss("[]");
auto json = JSON::JSON::parse(ss);
return json.isValid();
})
...
// and some invalid examples
TEST("singleton array with decimal-trailing zero '[0.]' is not valid JSON", []() {
std::istringstream ss("[0.]");
auto json = JSON::JSON::parse(ss);
return !json.isValid();
})
TEST("singleton array with comma '[,]' is not valid JSON", []() {
std::istringstream ss("[,]");
auto json = JSON::JSON::parse(ss);
return !json.isValid();
})
I color-coded the test results:
#define DESCRIBE(message, tests)\
std::cout << "TEST GROUP " << ++nGroup << ": " << message << " " << std::endl;\
tests;\
std::cout << COLOR_GREEN << "All tests passed!" << RESET_COLOR << std::endl;\
nTest = 0;
#define TEST(description, func)\
nTest++;\
if (!func()) {\
std::cout << COLOR_RED << "FAIL: " << "TEST " << nTest << ": " << description << RESET_COLOR << std::endl;\
return 1;\
}
This is greatly helping progress. I could have, and indeed have, used an off-the-shelf testing framework like Catch2, but I wanted to see if I could build something myself that was useful enough. So far, so good. I don’t add a test unless it fails, as per TDD, then add just enough code to made the test pass without breaking the others. It’s amazing how well this works, and also how hard it is to not add more code than is necessary to make a new test pass.
The discipline is paying off, giving me a sense of confidence in my implementation, and once I’ve run out of payloads that cause tests to return the wrong result, I’ll move on to the actual parsing implementation. The code is built to have parsing shimmed in without having to pass over the payload an additional time. We’ll see how naive this is. 😂
...
switch(presentByte) {
case 'n':
valid = false;
if (lastLexeme != Lexeme::None && lastLexeme != Lexeme::LeftSquareBracket && lastLexeme != Lexeme::Comma)
return;
handleNull(stream, valid);
break;
case 'f':
valid = false;
handleBoolean(stream, valid, false);
break;
case 't':
valid = false;
if (lastLexeme != Lexeme::LeftSquareBracket && lastLexeme != Lexeme::None)
return;
handleBoolean(stream, valid, true);
break;
case '"':
valid = false;
if (lastLexeme != Lexeme::None && lastLexeme != Lexeme::LeftSquareBracket && lastLexeme != Lexeme::Comma)
return;
lastLexeme = Lexeme::DoubleQuote;
handleString(stream, valid);
if (!valid)
return;
break;
...
Even now as I look at this excerpt, I have to tell my eye not to twitch due to the naivety of the implementation, but until I write a reasonable test that proves it is naive, I’ll leave it alone.
Thanks for reading this! It’s been fun writing it, but I must get back to work. I just got a take-home coding assignment from a software company based in Berlin, and I have a feeling it’s going to be quite hard. Tschüß, meine Freunde!
UPDATE [20240814]
Those function-like macros were proving to be tedious, despite their purpose being to prevent tedium. I took some time to completely rewrite the automated testing code:
...
static void runValidityTest(ValidityTest& test) {
bool result{false};
if (test.isFile) {
std::ifstream fileStream(test.input);
auto json = JSON::JSON::parse(fileStream);
result = json.isValid();
fileStream.close();
} else {
std::stringstream stream(test.input);
auto json = JSON::JSON::parse(stream);
result = json.isValid();
}
if (result != test.expected) {
failureCount++;
std::cout << COLOR_RED << "[FAIL] " << "\"" << test.input << "\" ";
std::cout << COLOR_RED << "EXPECTED: " << boolToString(test.expected) << "; GOT: " << boolToString(result) << RESET_COLOR << std::endl;
}
}
And the truncated syntax validity test table:
...
static std::vector<ValidityTest> validityTests{
// objects
{"{}", true},
// arrays
{"[\"\"]", true},
{"[\"\", \"\"]", true},
...
{"null", true},
// files
{"tests/step2/valid.json", true, true},
{"tests/step3/valid.json", true, true},
{"tests/step3/invalid.json", false, true},
{"tests/step4/valid.json", true, true},
};
std::for_each(validityTests.begin(), validityTests.end(), runValidityTest);
...
More excitingly, I have begun the parsing logic. The tests, so far:
std::for_each(validityTests.begin(), validityTests.end(), runValidityTest);
std::vector<ParsingTest> parsingTests{
// booleans
{ "true", true },
// strings
{ R"("cheesedurger")", "cheesedurger" },
{ R"("\"")", "\"" },
{ R"("\\")", "\\" },
// arrays
{ R"([])", std::vector<JSON::ValueType>{} },
{ R"([false])", std::vector<JSON::ValueType>{ false } },
// numbers
{ "9", 9.0 },
{ "9.0", 9.0 },
};
std::for_each(parsingTests.begin(), parsingTests.end(), runParsingTest);
runParsingTest
is, honestly, monstrous, but I think that’s fine. It does a lot of work to ensure that the std::variant
type that defines JSON::ValueType
doesn’t melt down when a value that isn’t set in it is accessed, as well as specfic serialization logic for the expected and generated values for each type for when tests fail and I want more information than FAIL [null]
, as much fun as that is.
Next is to keep adding test data to the parsing test table, and to the validity test table if I encounter syntax validation bugs along the way. Regrettably, runParsingTest
will have to get more and more monstrous as I add more std::holds_alternative<BigSteveT>
checks and beef up the serialization logic for errors.
One day, maybe I’ll have written a fully-functional, reasonably fast and memory-efficient JSON parsing library. A boy can dream.
Hasta luego, amigos.
UPDATE [20240815]
...
static void runParsingTest(ParsingTest& test) {
std::stringstream stream(test.input);
auto json = JSON::JSON::parse(stream);
auto value = json.getValue();
if (std::holds_alternative<bool>(test.expected)) {
if (!std::holds_alternative<bool>(value)) {
failureCount++;
std::cout << COLOR_RED << "[FAIL] " << test.input << " cannot be compared to received value" << RESET_COLOR << std::endl;
return;
}
auto booleanValue = std::get<bool>(value);
auto booleanExpected = std::get<bool>(test.expected);
if (booleanValue != booleanExpected) {
failureCount++;
std::cout << COLOR_RED << "[FAIL] ";
std::cout << "EXPECTED: " << boolToString(booleanExpected) << "; GOT: " << boolToString(booleanValue) << RESET_COLOR << std::endl;
}
} else if (std::holds_alternative<std::string>(test.expected)) {
if (!std::holds_alternative<std::string>(value)) {
...
Embarassing, but I was sloppy with my testing code and ended up with three false positives for the parsing tests. I deleted those. Now, the expected
is checked to see if it contains a type for which testing logic is written, then checks the value received from the parser for the presence of the same type. Tests fail with helpful messages. If neither the expected
nor the parsing result contain the type:
} else {
failureCount++;
std::cout << COLOR_RED << "[FAIL] Testing logic not set up to handle variant for test of JSON value " << test.input << RESET_COLOR << std::endl;
}
Phew. I’ll refactor this function as it’s very repetitive. It’s too easy to copy and paste sometimes.
Added a couple more test table rows. Progress is slower now, especially since I realized I may have to redesign the data structure for storing the C++-equivalent types. A tree seems intuitively right, though I’ll have to have a good idea of how, and what kind of tree.
I also really need to separate concerns.
case 'f':
valid = false;
if (!validPrecedingLexemes.at(JSON::Lexeme::Boolean).count(lastLexeme))
return;
handleBoolean(stream, valid, false, boolean); // <-- this also has validation signaling logic in it, as that `valid` is passed in by reference
if (valid) {
if (lastLexeme == Lexeme::LeftSquareBracket) {
scratchVector.push_back(boolean);
} else {
value = boolean;
}
lastLexeme = Lexeme::Boolean;
}
break;
The validation logic should be kept away from the parsing logic. I want to wrap some data structure in a class that provides methods that handle constructing the stored, parsed JSON data. Within there, I could (maybe) put in calls to some validator function or something. Then, this main function with the giant switch statement can just look at the stream, call a method to put the character in, and check to see if the method returns false
to signal an invalid payload. If so, the calling function can return and the JSON class instance is returned from its caller, marked as invalid. This will be much cleaner, and the separation of concerns will make it a bit easier to reason around and continue iterating because, honestly, this has gotten gross and will get much worse once I start adding support for parsing arrays and objects of arbitrary depth.
Hmm it would probably be much easier to do a validation and tokenization pass first, then use that data to form a tree or something. I don’t know. This is pretty hard. I’m not giving up, though. More updates to come. I perhaps should start pushing up commits with messages that correspond to blog updates so readers can follow along instead of rebasing everything so the repo looks like it was written in a single, fevered night of coffee-induced insomnia. Eh… we’ll see.
Thanks and bye bye.
UPDATE [20240816]
I’m starting over, again. Each iteration is such a learning experience. I have a good plan now.
void runParsingTest(ParsingTest& test) {
std::stringstream jsonStream(test.input) ;
auto got = JSON::JSON::parse(jsonStream);
if (test.expected != got) {
failureCount++;
std::cout << COLOR_RED << "Failure: expected " << test.expected << "; got " << got << RESET_COLOR << std::endl;
}
}
The first step here is to get the !=
and <<
overloaded. The !=
compares the private member value
(which is a std::variant<int>
to start) to the other JSON::JSON
object’s value
member, and decides how to compare the two based on the private member type
which is an enum class
instance (just Number
, to start). All the complexity of comparing any two values supported by the std::variant
struct is tucked away in the !=
method’s innards. This is much nicer than pulling those std::variant
out and having to deal with the complexity of testing which values are set in them. That’s all the concern of the JSON::JSON
class itself.
The <<
operator will now use a private std::ostringstream
member (rather than creating a new one for each invocation) to stream the underlying std::variant
value, its approach depending on the type
enum value. It will clear the stream first. I could do it after each use, but I’ll save the operation for when time the stream is needed. I don’t think it will affect anything if the stream has the result of the last output in it, as it’s private.
The structure of the JSON
namespace
is now:
namespace JSON {
struct ValueType;
using ValueVariant = std::variant<int>;
struct ValueType : ValueVariant {
using ValueVariant::ValueVariant;
};
enum class TypeEnum {
Number
};
class JSON {
ValueType value;
TypeEnum type;
public:
static JSON parse(std::istream&);
JSON(const int& v) : value(v), type(TypeEnum::Number) {}
bool operator!=(const JSON& other) const;
std::iostream& operator<<(const std::iostream& next);
};
};
I am starting with just enough code to
- parse a stringstream representing a JSON number and return a
JSON::JSON
object instance containing astd::variant<int>
in itsvalue
member, withtype = TypeEnum::Number
. - allow comparison in the test to check if the underlying
std::variant<int>
value is NOT equal to anotherJSON::JSON
’s underlyingstd::variant<int>
value, so a failure message can be printed out. - create a new
JSON::JSON
instance with just anint
object, automatically setting the appropriateTypeEnum
value. - overload
<<
so that I can stream out a serialized version of theJSON::JSON
object’svalue
member (just anint
to start, will adddouble
and others later) for test failures.
I feel pretty confident about this approach. I don’t want to add it yet, but there will need to be a pointer to a parent JSON::JSON
object in the JSON::JSON
class. This doesn’t apply for a simple JSON number, but will be needed for members of objects and arrays. I’m thinking of a std::shared_ptr<JSON::JSON> JSON::JSON::getParent()
method that will return a smart pointer to the object’s parent, or nullptr
. This will allow the static JSON::JSON::parse
method’s logic to go “up” to a value’s container after it’s been added so more values can be added, or so that it can continue travelling up the tree as more }
and ]
are encountered in the byte stream, signaling the anticipated completion of whatever container currently is in scope. A nullptr
parent
signals to the parsing logic that the value currently in scope is at the top of the tree.
I think this tree structure, combined with state like “parsing an object and presently expecting a value after a key” or “encountered a comma and expecting another value because I am forming an array now” will ensure that the resulting representation is accurate, and that the JSON stream is, to the knowledge of JSON::JSON::parse
up till the given JSON byte, valid and therefore parsing should continue.
Learnings so far:
- My last approach seemed reasonable until I started getting into more complex parsing situations, like nested objects. TDD is good, but tests should not preceed a good idea of how things will be implemented from a structural and API point of view.
- The validation logic was way too complex. A state-based validation approach + opinionated data structure I believe will be fast AND accurate.
- Further, I will not allow invalid
JSON::JSON
objects to be created. When the state of the parser, the state of theJSON::JSON
object and the incoming byte from the stream do not agree, the parser will throw an error and halt pasing altogether. What good is an invalidJSON::JSON
object? - A tree structure is good, but I need a pointer to a value’s parent at all times. I won’t expose this outside of the
JSON::JSON
class, as it’s not a useful thing by itself. It’s there for the parser to traverse the tree, and to determine whether it can continue processing the incoming byte given the state of the parser and the currently-pointed-toJSON::JSON
’s type enum value (e.g., Oh, you’re anTypeEnum::Array
and we’re expecting another value because we just encountered a comma character? Okay, I’ll give you the next valid value I have, then expect either another comma or to close you out and go up to your parent, if it exists). - Projects like this need careful consideration about data structures, interfaces, and algorithms before code is written. I actually decided to completely start this project over while driving to pick up dinner. I realized that my approach’s complex implementation was a code smell, and that not being able to simply return to a value’s container parent was going to make it even worse, and a sound data structure combined with the parser’s state about what it expects next given the tree node’s underlying type and its last processed value would make a far more elegant and stable solution. It was a eureka moment that I could not have had while sitting in front of the editor, trying to shim in more and more escape clauses to make a test pass.
- Testing code should be simple and easy to read. Having a giant function to probe the
std::variant
structure for its contained value was a code smell.JSON::JSON
objects should be comparable directly with==
and!=
. All the complexity around that is not the test’s business. - C++ is pretty unfriendly, but if I have to write a bunch of code I have a hard time reading later in any language, it is very likely that my approach is flawed.
This has been fun. I didn’t expect it to be easy. The biggest skill I think I am building here is planning. If I had really thought an algorithm + data structure combination through before even opening my IDE, I could have saved myself so much frustration and time. Maybe I need to buy a white board and some markers.
Thanks for reading this. I hope it was interesting. I know I’m learning a lot.
Written by Stephen Horne
← Back to blog