YAML Parsing – Is It Wrong to Parse ‘True’ as a String?

parsingyaml

Given these lines of YAML:

version: 1.00
y: 1

What does this represent?

According to the YAML spec (I'm not a delicate enough to read the spec like a lawyer), does this necessarily represent that a key str("version") maps to a value float(1.0), and a key bool(true) maps to a value int(1)? Or is it up to the parser to decide whether 1.0 should be a float or a str?

As these two lines obviously suggest, version: 1.0 should really be parsed as a string not as a number. And y: 1 might actually mean a Y-coordinate rather than a boolean true. This is a common pitfall in YAML config files.

So a co-worker suggested to write a parser that parses a YAML file based on a user-given data structure (like the Go unmarshal libraries) rather than based on the type inferred from the YAML code. I opposed because I think this kind of parsing would violate the YAML specification, making the YAML file no longer "real YAML".

An immediate consequence (if I interpret the YAML spec correctly) is that, suppose someone made an interactive GUI editor for YAML files (maybe it's something like Windows regedit), the editor would parse the above YAML code as version: 1.00 a float, then emit it back as 1.0, leading to inconsistent data. If my understanding on YAML spec is correct, the editor does not make incorrect assumptions on the information carried by the YAML code, so the editor isn't doing anything wrong; and the user is just following our so-called YAML standard (and hence believe that it is standard YAML, and the editor is also following standard YAML), so the responsibility would be in us (not the user, not the editor). Therefore I oppose to this custom parser.

Am I correct? If I am correct, how should I convince that he is wrong? If I am incorrect, how to explain this scenario? (Please convince me, because we're getting a bit heated)

(DO NOT TURN THIS INTO AN XY QUESTION. We all know that YAML has problems, but let's assume we can't change to another format to prevent off-topic)

(Context: we are working on an open-source project which has users from different levels of the education spectrum)

Best Answer

YAML is an extremely complicated format, and the result of that document may differ between YAML versions and parsers. YAML itself does not prescribe a data model beyond scalars/sequences/mappings/pointers. However:

YAML has the concept of tags that serve as explicit type annotations (!!str 1.00, !MyCustomTag foo), which are sometimes used to discover constructors through reflection (this is where all those YAML security issues come from).

YAML also has a concept of schemas that can be used to derive implicit type information from unquoted strings. These types are usually defined as a regex that must match the value. For convenience, all parsers seem to offer a JSON-ish default schema, sometimes with YAML extensions (like boolean yes/no), and often with language-specific extensions as well. Many parsers allow user-defined schemas. Some YAML parsers give access to the document before it is interpreted under some schema.

The result of these schemas is that while multiple parsers may parse a document identically, they could interpret the data types differently, unless they use the same schema. Schemas are not declared within the document.

This means that a YAML editor does have to treat types and schemas explicitly. The editor should automatically insert tags to prevent a value from being interpreted as the wrong type under the given profile. For example, to have the value true be recognized as a string quotes or a tag would work: "true", 'true', !!str true. Alternatively, an editor can be type-agnostic but that can lead to seemingly trivial edits changing the type of a scalar.

Your coworker is slightly more correct assuming they are suggesting that you define a custom schema or parse without any YAML-level schema. Due to the proliferation of parsers there is no clear standard schema (while the YAML 1.2 standard declares a possible Core Schema it is not mandated, and most parsers default to interpreting documents as under previous YAML versions).

Related Topic