My thoughts on YAML

I've been thinking a lot about YAML recently, as I find myself continuously alternating between loving it and hating it. Here are my thoughts.

Many times over I've wanted to make certain adjustments to YAML and write my own parser. Ultimately, I never end up doing it, because I know it's not a good idea to fragment an ecosystem (even if only for myself) and because it's a large time investment for very little return. So instead, I'm venting my frustrations in the form of a blog post. To verify my position on the matter, or at least make sure I'm not an outlier, I ran a little poll in the Eleventy discord and asked people on their opinion on YAML. Here are the results (18 people responded, myself included):

Clearly I'm not entirely crazy, given almost half the people responding hate YAML at least sometimes - but I have to admit I was quite surprised to see half of the respondents saying they like YAML.

I think a lot of it comes down to context and personal tendencies. If you already like two-space indents, and you consistently use quoted strings in YAML, then most of YAML's pitfalls are trivialized. I wouldn't necessarily say that that's an argument in YAML's favor, rather it allows one to overlook certain issues in the language.

Quoteless strings

One of YAML's main problems stems from the fact that it allows unquoted strings. This is nice in theory, because typing punctuation is not usually ergonomic, but ultimately creates many edgecases that we need to be aware of. For example, using colons (followed by a space) inside a string invalidates the whole thing, and we need to be mindful that some strings (even if they look like strings in context) are not parsed as such. A classic example is the Norway problem, where in older versions of YAML, writing [fi, no, se, is] for a list of Scandinavian countries would produce an array of three strings and one boolean; no is not a string, but equivalent to false. This is a case that YAML 1.2 has addressed, but still there are cases where unquoted strings can bite you in the behind:

name: My product: its version hashes
description: The # of hashes must be less than 12
version_hashes:
- 19bdb736
- 96677875
- ce7d86ef

This YAML does not do what we want (even if we fix the error on line 1), and it's all the fault of allowing quoteless strings. You could argue that this is my fault for writing it this way, but I say that's victim blaming; I've been handed a gun that's pointed at my foot and told I can use it however I like. I know it's my responsibility not to shoot myself in the foot, but can I be blamed if I do?

Indentation

I don't have strong opinions on the whole tabs-versus-spaces debate, so I won't go into that, but I do feel somewhat frustrated with the fact YAML doesn't allow tabs. The spec mentions that

To maintain portability, tab characters must not be used in indentation, since different systems treat tabs differently.

I'm probably too young to really understand where this is coming from, but I've never had any issues using tabs in other languages and don't really understand why this would apply to YAML specifically. I wouldn't mind as much if I was only forced to use spaces in YAML files, but when building static sites, you'll often find YAML front matter, meaning the indentation you choose also applies to the markup you write. Essentially, it prevents me from ever using tabs in static sites, which is a bummer because tabs are great for allowing users to specify their preferred indentation size.

It's too much

Also, I don't feel this point as strongly as the previous two, but YAML tries to do far too much. The spec is rather long and there are many features that are so niche I've never seen them in the wild despite having seen many YAML files in many different contexts. The main offenders there are tags and anchors, but it extends to things like explicit mapping keys with ? or using objects as keys for another object.

Also, I've recently discovered that parsing unknown YAML can run arbitrary code. In JavaScript terms, parsing YAML is as evil as using eval(). This is a side effect of YAML tags. In a nutshell, YAML would be much nicer and safer if it was simpler.

YAML wasn't meant to be written

Fun fact: did you know YAML originally stood for "Yet Another Markup Language" but then changed its name to mean "YAML Ain't Markup Language"?

Note

I strongly dislike recursive acronyms, including "YAML", and I feel the need to vent about that, too.

The reason they did that is because YAML was intended to be a "data serialization language", meaning it's supposed to be written to by machines, not humans. In fact, YAML's top priority is being readable by humans, whereas being writable by humans is nowhere to be found in YAML's goals. This explains the quoteless strings, but ultimately it's a moot point because the reality is that YAML is being written by humans, all the time. I find it surprising that YAML's creators didn't foresee that creating a human-readable data serialization language would end up with people writing it by hand; even JSON, which has an incredibly simple syntax and is intended for machine-to-machine data transfer, is being manually written. To be fair, though, YAML is very old and what might be obvious now might not have been obvious at the time YAML was invented (2001).

Alternatives

While I don't like my love-hate relationship with YAML, the truth is that there is no good alternative available. The obvious second choice is JSON, but JSON is just not nice to write by hand. I do appreciate some tools accepting JSONC (i.e. JSON with comments), but even that is not what I want be to using as main configuration/front matter language for the rest of my life.

I would love to create my own data markup language, but I've come to realize I'm not the ideal person to create a language. My own requirements are very basic, as I don't rely on automplete, intellisense, or most other IDE features. Really the only thing I want is syntax highlighting. And, I use Sublime Text, not VSCode. So, if I write my own language, then even if I create a syntax highlighting package for Sublime Text, other people won't use it. Heck, if I write a syntax highlighting package for VSCode, still people won't use it because it lacks all the funky features people want that I don't care about (and also don't care to support in my free time). Not to mention, AI completions for new languages are probably terrible, but again, this is not something I use or care to cater for.

Nevertheless, I might try to create my own ideal data markup language, just for the heck of it… If I do, I'll write another post about it.