some formats

2020-07-11

We've talked a little bit about markup languages. Broadly speaking, and to use a taxonomy which I completely made up by myself, most markup languages in use for data interchange today are either enclosure-style, in which each element is enclosed by start and stop delimiter (eg. HTML, XML), or key-value style, in which the file consists more or less of a list of keys and values which may be enclosed in various ways to indicate structures like maps and lists (e.g. YAML and JSON). Of course there are many others as well and I'm speaking only of data interchange here, not more general markup, but the point stands that these two families are mostly what we use today when we need to get structured data from one thing to another.

Just trying to organize things this way brings us to a somewhat complex question: what exactly is a markup language? My carefully constructed (in about thirty seconds while slightly inebriated) taxonomy happens to exclude, for example, markdown and RST, which would generally be called markup languages. This is partially because I'm just focusing only the things that are interesting to me in this case, but it's also partially because the concept of a markup language and/or a data interchange format are somewhat loosely defined.

Wikipedia, which is never wrong, says that "a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text." This definition, on a plain reading, clearly includes HTML, Markdown, RST, and many others. Things get a little weird when we look at XML. It has Markup Language right in the name, and it can certainly be used in a fashion similar to HTML (see: the last post), but it often isn't. In cases like XML, and even more so with YAML, the argument that the markup is just an annotation on the text becomes a lot harder to defend. I would be tempted to refer to these as "data interchange formats" rather than "markup languages," but that term is already in use for something different. We could also call them "serialization formats" but people tend to associate that term more with binary formats. So the basic terminology is rather confusing here, and if I had a bit of common sense that's what I'd be trying to taxonomize.

The point of all of this is that I would like to talk a bit about formats which are used for interchanging data between different systems (or occasionally for storing and retrieving data within the same system). These are often called markup languages but are probably not really markup languages in that they do not focus on annotating (or marking up) text, instead they express data structures which may contain text but are not necessarily text documents. These are "markup?" languages like XML, YAML, JSON (this one doesn't call itself a markup language!), and various others. And specifically, I am talking about the ones that are text-based, as opposed to binary formats like protobuf and others.

It's very interesting to me to look at the history of how we got to our modern concept of data interchange formats. There is a surprising amount of homogeneity in most modern software. XML is very widely used but decidedly out of vogue with today's youths. JSON is perhaps the most widespread because it is (kind of) easy to use and (kind of) natively supported by JavaScript, but there are a surprising number of caveats to both of those. YAML is also quite common but surprisingly complex, and it has an uneasy relationship with JSON wherein JSON documents are also valid YAML documents but you should probably forget that. There are some upstarts like TOML and something called HOCON? But no one really cares.

As mentioned previously, XML dates back to roughly 1998. YAML came about in 2001, not that much later, but became popular probably more around the mid to late 2000s when it was viewed as the antidote to XML's significant complexity. Most people don't realize that YAML is probably just as complex, because it looks very simple in the minimal examples that most people constrain themselves to.

XML has SGML as an antecedent, and SGML is derived from IBM formats which date back to 1970 or so. Interestingly, this ancient ancestor of XML (called GML, because it was before Simple GML), has a certain superficial resemblance to YAML, at least in that it involves significant use of colons. That's a bit interesting as YAML does not have any clearly described ancestors.

So how does GML work? Well, it worked much like SGML in having start and end tags, but tags were started with a colon and ended with a period, rather than using the greater than/less than symbols. But GML also had a very strong sense of being line-oriented, that is that tags generally went on their own line, which is a bit more similar to YAML than to SGML.

In fact, the great bulk of early data interchange formats were line-oriented. There are various reasons for this, chief among them that it is simply intuitive to put "one record per line," as it matches conventional tabular formats that we're familiar with in print (e.g. tables). It was also essentially a technical constraint of punched-card based computer systems, where "line" and "file" (in the modern sense) were more or less equivalent to "card" and "stack" when working with punched cards---that is, each card was considered a line of text. That each card could be called a "record" and a set of records made up a file shows the degree to which electromechanical punched card systems, and the computers derived from them, were intended to model pre-computer business records kept as lines in ledgers.

Overall I have found it extremely difficult to trace any kind of coherent history of these formats, which is probably reflected in how disorganized this message is. Many old data interchange formats have familial resemblances to each other, giving the tantalizing suggestion that a "family tree" could be traced of which were based on which others, but actually doing this would probably require a great deal of original research and I have both a full-time job and hours of standing in the living room staring at the wall to keep up with, so while I have made some tentative forays into the matter I do not expect to publish a treatise on the origins of XML any time soon.

Instead, I would like to mention just a few interesting old data interchange formats and some things we can learn from them. Most of these examples are old, all of them come from a context in which a body of experts attempted to design a single, unified data model sufficient to meet all the needs of a given problem domain. This has profound implications. I have said before and I will say again that computer science is the discipline principally concerned with assigning numbers to things. In the realm of computer science (and specifically AI, in the original meaning of AI, not the marketing buzzword of today) research, the term "ontology" is borrowed from philosophy to refer to defining the nature of things. That is, ontologists in CS do not seek to establish what is, they seek to represent what is. This is perhaps the highest-level academic discipline of assigning numbers to things and deals with fundamental and theoretical questions about how computer systems can represent and manipulate complex domains of knowledge. While the ontologists of philosophy ponder what does and can exist, the ontologists of computer science ponder how to punch all of that onto paper cards.

XML is not exactly a masterpiece of ontology, but there is a whiff of ontology throughout the world of data interchange formats. Designing a domain-specific interchange format requires considering all of the areas of knowledge in that domain and assigning codes and keywords to them. Designing generalized interchange formats requires considering all of the structures of knowledge that need to be expressed. Because the set of data structures in use by computer systems is in practice highly constrained by both the limits of technology and the limits of the people who use the technology (essentially everything in life is either a map or a list, regardless of what your professors told you about bicycles and inheritance), it seems that in practice creating a generalized markup language is almost the easier of the two efforts. At least JSON is really dead simple. Of course, for generalized languages which support schemas, schemas tend to bring in domain-specific knowledge and all the complexities thereof.

So let's forget about generalized markup languages for now and jump back to a time in which generalized markup languages were not in widespread use and most software systems exchanged data in domain-specific formats. These domain-specific formats were often being developed by domain experts using very careful consideration of everything which may need to be represented. We see in this pursuit both complex theoretical problems in computer science and the ways in which large parts of computer science (generally the more applied assigning of numbers) are derived from information or library science.

That was an extremely long preamble to get to the actual point of this message, but hopefully it provides a bit of context into why I am about to tell you about MARC.

If I am to argue that we can blame large parts on computer science on library science, MARC is my key piece of evidence. Librarians and other information science types are deeply concerned withe the topic of "authority control," which is basically about being able to uniquely identify and look up information based on standardized names. A book ought to have one title and one author (or set of authors) which can consistently be used to look it up, even though people are prone to use abbreviations and write names in different ways. A similar problem is seen in genealogy where the spelling of family names often drifts from generation to generation, but researchers tend to consider "McLeod" and "MacLeod" to be the same name despite the variable spelling. You could argue that when Google corrects your spelling errors it is practicing a form of authority control by standardizing your query to the authorized vocabulary.

Yes, authority control tends to be based around the idea of establishing a restricted vocabulary of standardized, or authorized, names. J. R. R. Tolkien, John Ronald Reuel Tolkien, and my insistence on misspelling it J. R. R. Tolkein ought to all be standardized to the same authorized name, so that a query for any of these representations returns all of his books. "Tolkien, J. R. R." according to the library catalog. This idea of a standardized, constrained vocabulary will be familiar to anyone in computing as it's the same kind of thing we have to think about when dealing with computers. MARC rests at exactly the intersection of the two.

MARC is short for Machine-Readable Cataloging. It was developed for the Library of Congress in the 1960s for the purpose of representing the library catalog in computer form. It is still in fairly common use today as a "lowest common denominator" interchange format between library cataloging software developed by different vendors. While there is an XML variant today, MARC is most widely seen in its original, 1960s format that looks like this:

005 20180917152453.0 008 180410b ||||| |||| 00| 0 eng d 020 _c EC$20.00 (cased). 100 _a Tolkien, J.R.R. 245 _a The silmarillion / _c J.R.R. Tolkien ; edited by Christopher Tolkien. 260 _a London : _b Book Club Associates, _c c1977. 300 _a 365 p. ; _c 23 cm. 500 _a Includes index. 650 _a Baggins, Bilbo _v Fiction. 650 _a Middle Earth (Imaginary place) _v Fiction. _9 36397

Of course, this is not exactly what it looks like. This is in part because I have omitted certain fields to make it more readable, but it's more so because the standard representation of MARC makes use of non-printable ASCII control characters to separate fields, and not the newline. I have swapped out these control characters for newlines and spaces and then indented to make things more clear. I have also omitted some junk that comes out of the details of the format such as a bunch of extra slashes. The point is that I have made this format look tremendously more human-friendly than it actually is.

MARC consists of fields, each identified by a three-digit number. Fields may have subfields, identified by a letter. For example, field 245 is Title Statement. Subfield A is Title, subfield C is "statement of responsibility, etc." according to the LoC documentation. Not all of these fields make so much sense. Field 008 is called "fixed-length data elements" and is part of the control fields (00x fields). It contains things like date the book was added to the catalog, where the catalog data came from, but also some less control-ey data like "target audience." But all of this is combined into one field using a fixed-width format, and the pipe is for some reason used as a "fill" character for fields which are required but have no data.

This idea of enumerating every field that might need to be expressed and then assigning numerical codes to them is a common aspect of early data interchange formats. I will show one other example before ending this rather long message and leaving more for later. That's a 1980s-vintage format that I have the pleasure of dealing with in my current day job, called Health Level 7 or HL7. HL7 serves as a "lowest common denominator" format for exchange of data between different electronic health record systems. An example HL7 record, courtesy of Wikipedia, follows, but note that I have removed some fields for brevity.

MSH|^~&|MegaReg|XYZHospC|SuperOE|XYZImgCtr|20060529090131-0500||ADT^A01^ADT_A01|01052901|P|2.5 EVN||200605290901||||200605290900 PID|||56782445^^^UAReg^PI||KLEINSAMPLE^BARRY^Q^JR||19620910|M||2028-9^^HL70005^RA99113^^XYZ|260 GOODWIN CREST DRIVE^^BIRMINGHAM^AL^35209^^M~NICKELL’S PICKLES^10000 W 100TH AVE^BIRMINGHAM^AL^35200^^O|||||||0105I30001^^^99DEF^AN OBX|1|NM|^Body Height||1.80|m^Meter^ISO+|||||F OBX|2|NM|^Body Weight||79|kg^Kilogram^ISO+|||||F AL1|1||^ASPIRIN DG1|1||786.50^CHEST PAIN, UNSPECIFIED^I9|||A

If we can stop chuckling at "Nickell's Pickles," we can see that this looks very different from MARC but there is a similar phenomena going on. Each line is a field with components separated by pipes. The first component is a three-character (but now alphanumeric) field ID. MSH identifies message type, PID is patient identity. Each of these is separated into many subfields, in the case of PID we can make out an ID number, a name, date of birth, etc. Once again, the same basic concept of code-identified fields with various subfields, and once again represented as one field per line. This time, mercifully, the field separator is newline and the subfield separator is pipe. These are conveniently human-readable so I have not had to replace them with whitespace. Finally, we once again have the use of odd filler symbols, mainly ^.

^ needs to be used basically because of a limitation in the data model, there is no way to separate "subsubfields." Consider the address. "260 GOODWIN CREST DRIVE" has a space in it, spaces are quite acceptable. But the EHR in use, like most software, feels the need to separate components of the address into tidy fields. Space can't be used to separate subsubfields because it's used within the subfields. Newline can't be used because it's the field separator. So instead, ^ is used. Further, both ^ and ^^ are used to represent subsubfield separations of different orders. "BIRMINGHAM^AL" is essentially equivalent to "BIRMINGHAM AL" except that the use of ^ rather than space assures the parser that it is the separator between city and state, not a space within the name of the city. Humans are largely smart enough to figure out that there is probably no city called "Birmingham Al" and so the "AL" must be a state, but computers are not.

Alright, I'm going to try to stop talking now. But I want to follow up in a future post by going on at length about fixed-width fields and their long heritage, and also perhaps about the pipe as a field separator, which is something that's very widely seen in early (say pre-1995) formats but rarely seen today. That will bring me to the matter of the comma as a field separator, something that is in fact very common today and has turned out to be a monumental pain. Finally, I'll loop back to those ASCII control characters that MARC used and I removed for you, and wonder why no one uses them today.