I need a textual human readable format which is reasonably compact and version-control friendly to serialize a persistent memory heap. My Bismon system (GPLv3) has such a format (it is textual, human-readable, git
-friendly, occasionally editable under emacs
, but specific to Bismon. It is usually loaded and dumped by the bismon
program.). That format is documented in the Bismon technical draft report (please skip the first few pages for H2020 bureaucracy), chapter §2 Data and its persistence in Bismon. For an example of a file using that format, look into Bismon's store1.bmon (and other store*.bmon
files).
I am considering that such a format might better be JSON like (but I am not sure). Just because many developers are familiar with JSON.
The JSON format requires object keys to be quoted strings, e.g. { "x":1, "y":2 }
.
I am thinking of an application (maybe RefPerSys, which conceptually could become a Bismon done right) where a JSON notation is very useful (and where a human-readable textual file format is essential), but where we deal with only JSON objects whose keys are always C-identifiers like (starts with an latin letter, contains letters, digits, underscores). However, that application may need to parse perhaps a million of such objects, and parsing performance does matter a little, and more significantly file space will matter a lot (since for {x:1,y:2}
only 9 bytes are needed, but {"x":1,"y":2}
requires 13 bytes, i.e. about 40% more space). My exact goal is any textual, human-readable, quickly and easily machine-parsable, tree-structured, compact version-controllable (i.e. git
friendly) format. Most of the time it is dumped and loaded by the same application. Occasionally, I may need to glance into it with some editor, and perhaps even to change a small bit of it with that editor. I am not imagining needing a generic JSON transformer or processor like jq.
But my feeling is that, when the keys are C identifiers like (and different of the three JSON keywords: true
, false
, null
), the quotes could be avoided, like for example in {x:1, y:2}
. I am also understanding that some JavaScript implementations might be able to parse that.
I am obviously guessing that parsing {x:1,y:2}
is faster than parsing { "x":1, "y":2 }
or even {"x":1,"y":2}
(simply because the textual representation is slightly shorter) especially when we deal with millions of such JSON objects.
In a Bismon or RefPerSys like system, a possible example could be:
{ oid: _7T9OwSFlgov_0wVJaK1eZbn,
name: word,
mtime: 1502296590.98,
class: _7T9OwSFlgov_0wVJaK1eZbn,
attrs: [ { at: _01h86SAfOfg_1q2oMegGRwW, va: "for words" } ]
}
(currently, in commit ff19f15ecd2f647d42 of Bismon, the equivalent is in lines 1011 and following of store1.bmon
; the |
there delimits comments and these comments like |=word|
there could be removed, since skipped at parsing; the comments in these dumped and loaded *.bmon
files will be removed once Bismon is stable enough)
In a few years, I could have many millions of such JSON objects. The bismon
program is a server, started every morning (it then loads its persistent state in textual format) and ended every evening (it then dumps its persistent state in textual format). So taking one or a few minutes to load, and one or a few minutes to dump, a large persistent state is definitely acceptable. But the git commit
ed disk size of that textual persistent state is more a concern (since both gitlab
and github
are unhappy with large textual files).
Since humans will very rarely look into the textual persistent store (as rarely as a compiler writer is looking into generated assembler, or as rarely as the sqlite team is looking into huge *.sql
files), I value compactness and git
-friendliness over readability. So I could even consider something as compact as:
{oid:_7T9OwSFlgov_0wVJaK1eZbn,nam:word,mti:1502296590.98,
cla:_7T9OwSFlgov_0wVJaK1eZbn,
att:[{a:_01h86SAfOfg_1q2oMegGRwW,v:"for words"}]}
or even the same in a single line. However, being occasionally able to git diff
is valuable.
In other words, the JSON model is very nice to me. But its concrete syntax less so. Patching most JSON libraries to such a simplified syntax is very probably trivial work.
This brings three questions:
-
what is the exact name of such common variant of a JSON format with such only C-identifier keys. (It seems that the YAML specification still suggests it to be JSON, but it is not exactly JSON but something very close to it). While that format is not exactly JSON, it is very JSON like (and the conversion to exact JSON is trivial, assuming a parsing library exists for it).
-
what are open source C or C++ libraries dealing with that format (for Linux/x86-64)? I am guessing that adapting the source code of JSON parsing libraries to that special case is trivial. But I really want to avoid forking one.
-
can recent Web browsers (Firefox or Chrome) efficiently parse
{x:1,y:2}
as JSON? I tend to believe that yes (since that notation is exactly compatible with JavaScript).
This GIT and YAML answer could be relevant.
And I just discovered HJSON which might be what I want.
PS. I can avoid any set of given C keywords or identifiers in the keys, if I have such a list of forbidden or reserved names. In particular, I will avoid every JavaScript or C++ keyword (like for
or auto
or while
) for key names. The only platform I care about is Linux (currently x86-64).
PPS. Another application where human-readable textual file format is essential is my Bismon project (a persistent reflexive monitor for static source code analysis, under GPLv3+ license), and I am explaining why in the Bismon draft report (that is a H2020 draft deliverable, so please skip the first few pages for H2020 bureaucracy). I have chosen in Bismon to have my own human-readable textual format, but that particular choice might have been a big mistake, and I probably should have used some JSON-like one (or even JSON itself), like suggested in this question. The RefPerSys project might become a "Bismon done right" project. And the persistent data of Bismon (e.g. its store2.bmon
textual file) is git
-version controlled and occasionally hand-edited (but most of the time, loaded and dumped by bismon
itself). So, yes, there are cases where textual data cares about a 20% space difference: for both gitlab
and github
, a textual version-controlled file of 700Kbytes or of 1.1Mbytes is presented very differently: in Bismon, its store2.bmon file is already shown only in raw format.
Best Answer
Keys in a JSON dictionary are not quoted strings, they are strings. Strings in JSON start with a quote, continue with escaped or unescaped characters, and end with a string. You can’t have different JSON. You can define a different exchange format, but it won’t be JSON and you are completely on your own.