_____                   _                  _____            _____       _ 
  |     |___ _____ ___ _ _| |_ ___ ___ ___   |  _  |___ ___   | __  |___ _| |
  |   --| . |     | . | | |  _| -_|  _|_ -|  |     |  _| -_|  | __ -| .'| . |
  |_____|___|_|_|_|  _|___|_| |___|_| |___|  |__|__|_| |___|  |_____|__,|___|
  a newsletter by |_| j. b. crawford                       home subscribe rss

>>> 2020-07-11 some formats

We've talked a little bit about markup languages. Broadly speaking, and to use a taxonomy which I completely made up by myself, most markup languages in use for data interchange today are either enclosure-style, in which each element is enclosed by start and stop delimiter (eg. HTML, XML), or key-value style, in which the file consists more or less of a list of keys and values which may be enclosed in various ways to indicate structures like maps and lists (e.g. YAML and JSON). Of course there are many others as well and I'm speaking only of data interchange here, not more general markup, but the point stands that these two families are mostly what we use today when we need to get structured data from one thing to another.

Just trying to organize things this way brings us to a somewhat complex question: what exactly is a markup language? My carefully constructed (in about thirty seconds while slightly inebriated) taxonomy happens to exclude, for example, markdown and RST, which would generally be called markup languages. This is partially because I'm just focusing only the things that are interesting to me in this case, but it's also partially because the concept of a markup language and/or a data interchange format are somewhat loosely defined.

Wikipedia, which is never wrong, says that "a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text." This definition, on a plain reading, clearly includes HTML, Markdown, RST, and many others. Things get a little weird when we look at XML. It has Markup Language right in the name, and it can certainly be used in a fashion similar to HTML (see: the last post), but it often isn't. In cases like XML, and even more so with YAML, the argument that the markup is just an annotation on the text becomes a lot harder to defend. I would be tempted to refer to these as "data interchange formats" rather than "markup languages," but that term is already in use for something different. We could also call them "serialization formats" but people tend to associate that term more with binary formats. So the basic terminology is rather confusing here, and if I had a bit of common sense that's what I'd be trying to taxonomize.

The point of all of this is that I would like to talk a bit about formats which are used for interchanging data between different systems (or occasionally for storing and retrieving data within the same system). These are often called markup languages but are probably not really markup languages in that they do not focus on annotating (or marking up) text, instead they express data structures which may contain text but are not necessarily text documents. These are "markup?" languages like XML, YAML, JSON (this one doesn't call itself a markup language!), and various others. And specifically, I am talking about the ones that are text-based, as opposed to binary formats like protobuf and others.

It's very interesting to me to look at the history of how we got to our modern concept of data interchange formats. There is a surprising amount of homogeneity in most modern software. XML is very widely used but decidedly out of vogue with today's youths. JSON is perhaps the most widespread because it is (kind of) easy to use and (kind of) natively supported by JavaScript, but there are a surprising number of caveats to both of those. YAML is also quite common but surprisingly complex, and it has an uneasy relationship with JSON wherein JSON documents are also valid YAML documents but you should probably forget that. There are some upstarts like TOML and something called HOCON? But no one really cares.

As mentioned previously, XML dates back to roughly 1998. YAML came about in 2001, not that much later, but became popular probably more around the mid to late 2000s when it was viewed as the antidote to XML's significant complexity. Most people don't realize that YAML is probably just as complex, because it looks very simple in the minimal examples that most people constrain themselves to.

XML has SGML as an antecedent, and SGML is derived from IBM formats which date back to 1970 or so. Interestingly, this ancient ancestor of XML (called GML, because it was before Simple GML), has a certain superficial resemblance to YAML, at least in that it involves significant use of colons. That's a bit interesting as YAML does not have any clearly described ancestors.

So how does GML work? Well, it worked much like SGML in having start and end tags, but tags were started with a colon and ended with a period, rather than using the greater than/less than symbols. But GML also had a very strong sense of being line-oriented, that is that tags generally went on their own line, which is a bit more similar to YAML than to SGML.

In fact, the great bulk of early data interchange formats were line-oriented. There are various reasons for this, chief among them that it is simply intuitive to put "one record per line," as it matches conventional tabular formats that we're familiar with in print (e.g. tables). It was also essentially a technical constraint of punched-card based computer systems, where "line" and "file" (in the modern sense) were more or less equivalent to "card" and "stack" when working with punched cards---that is, each card was considered a line of text. That each card could be called a "record" and a set of records made up a file shows the degree to which electromechanical punched card systems, and the computers derived from them, were intended to model pre-computer business records kept as lines in ledgers.

Overall I have found it extremely difficult to trace any kind of coherent history of these formats, which is probably reflected in how disorganized this message is. Many old data interchange formats have familial resemblances to each other, giving the tantalizing suggestion that a "family tree" could be traced of which were based on which others, but actually doing this would probably require a great deal of original research and I have both a full-time job and hours of standing in the living room staring at the wall to keep up with, so while I have made some tentative forays into the matter I do not expect to publish a treatise on the origins of XML any time soon.

Instead, I would like to mention just a few interesting old data interchange formats and some things we can learn from them. Most of these examples are old, all of them come from a context in which a body of experts attempted to design a single, unified data model sufficient to meet all the needs of a given problem domain. This has profound implications. I have said before and I will say again that computer science is the discipline principally concerned with assigning numbers to things. In the realm of computer science (and specifically AI, in the original meaning of AI, not the marketing buzzword of today) research, the term "ontology" is borrowed from philosophy to refer to defining the nature of things. That is, ontologists in CS do not seek to establish what is, they seek to represent what is. This is perhaps the highest-level academic discipline of assigning numbers to things and deals with fundamental and theoretical questions about how computer systems can represent and manipulate complex domains of knowledge. While the ontologists of philosophy ponder what does and can exist, the ontologists of computer science ponder how to punch all of that onto paper cards.

XML is not exactly a masterpiece of ontology, but there is a whiff of ontology throughout the world of data interchange formats. Designing a domain-specific interchange format requires considering all of the areas of knowledge in that domain and assigning codes and keywords to them. Designing generalized interchange formats requires considering all of the structures of knowledge that need to be expressed. Because the set of data structures in use by computer systems is in practice highly constrained by both the limits of technology and the limits of the people who use the technology (essentially everything in life is either a map or a list, regardless of what your professors told you about bicycles and inheritance), it seems that in practice creating a generalized markup language is almost the easier of the two efforts. At least JSON is really dead simple. Of course, for generalized languages which support schemas, schemas tend to bring in domain-specific knowledge and all the complexities thereof.

So let's forget about generalized markup languages for now and jump back to a time in which generalized markup languages were not in widespread use and most software systems exchanged data in domain-specific formats. These domain-specific formats were often being developed by domain experts using very careful consideration of everything which may need to be represented. We see in this pursuit both complex theoretical problems in computer science and the ways in which large parts of computer science (generally the more applied assigning of numbers) are derived from information or library science.

That was an extremely long preamble to get to the actual point of this message, but hopefully it provides a bit of context into why I am about to tell you about MARC.

If I am to argue that we can blame large parts on computer science on library science, MARC is my key piece of evidence. Librarians and other information science types are deeply concerned withe the topic of "authority control," which is basically about being able to uniquely identify and look up information based on standardized names. A book ought to have one title and one author (or set of authors) which can consistently be used to look it up, even though people are prone to use abbreviations and write names in different ways. A similar problem is seen in genealogy where the spelling of family names often drifts from generation to generation, but researchers tend to consider "McLeod" and "MacLeod" to be the same name despite the variable spelling. You could argue that when Google corrects your spelling errors it is practicing a form of authority control by standardizing your query to the authorized vocabulary.

Yes, authority control tends to be based around the idea of establishing a restricted vocabulary of standardized, or authorized, names. J. R. R. Tolkien, John Ronald Reuel Tolkien, and my insistence on misspelling it J. R. R. Tolkein ought to all be standardized to the same authorized name, so that a query for any of these representations returns all of his books. "Tolkien, J. R. R." according to the library catalog. This idea of a standardized, constrained vocabulary will be familiar to anyone in computing as it's the same kind of thing we have to think about when dealing with computers. MARC rests at exactly the intersection of the two.

MARC is short for Machine-Readable Cataloging. It was developed for the Library of Congress in the 1960s for the purpose of representing the library catalog in computer form. It is still in fairly common use today as a "lowest common denominator" interchange format between library cataloging software developed by different vendors. While there is an XML variant today, MARC is most widely seen in its original, 1960s format that looks like this:

005 20180917152453.0 008 180410b ||||| |||| 00| 0 eng d 020 _c EC$20.00 (cased). 100 _a Tolkien, J.R.R. 245 _a The silmarillion / _c J.R.R. Tolkien ; edited by Christopher Tolkien. 260 _a London : _b Book Club Associates, _c c1977. 300 _a 365 p. ; _c 23 cm. 500 _a Includes index. 650 _a Baggins, Bilbo _v Fiction. 650 _a Middle Earth (Imaginary place) _v Fiction. _9 36397

Of course, this is not exactly what it looks like. This is in part because I have omitted certain fields to make it more readable, but it's more so because the standard representation of MARC makes use of non-printable ASCII control characters to separate fields, and not the newline. I have swapped out these control characters for newlines and spaces and then indented to make things more clear. I have also omitted some junk that comes out of the details of the format such as a bunch of extra slashes. The point is that I have made this format look tremendously more human-friendly than it actually is.

MARC consists of fields, each identified by a three-digit number. Fields may have subfields, identified by a letter. For example, field 245 is Title Statement. Subfield A is Title, subfield C is "statement of responsibility, etc." according to the LoC documentation. Not all of these fields make so much sense. Field 008 is called "fixed-length data elements" and is part of the control fields (00x fields). It contains things like date the book was added to the catalog, where the catalog data came from, but also some less control-ey data like "target audience." But all of this is combined into one field using a fixed-width format, and the pipe is for some reason used as a "fill" character for fields which are required but have no data.

This idea of enumerating every field that might need to be expressed and then assigning numerical codes to them is a common aspect of early data interchange formats. I will show one other example before ending this rather long message and leaving more for later. That's a 1980s-vintage format that I have the pleasure of dealing with in my current day job, called Health Level 7 or HL7. HL7 serves as a "lowest common denominator" format for exchange of data between different electronic health record systems. An example HL7 record, courtesy of Wikipedia, follows, but note that I have removed some fields for brevity.

MSH|^~\&|MegaReg|XYZHospC|SuperOE|XYZImgCtr|20060529090131-0500||ADT^A01^ADT_A01|01052901|P|2.5 EVN||200605290901||||200605290900 PID|||56782445^^^UAReg^PI||KLEINSAMPLE^BARRY^Q^JR||19620910|M||2028-9^^HL70005^RA99113^^XYZ|260 GOODWIN CREST DRIVE^^BIRMINGHAM^AL^35209^^M~NICKELL’S PICKLES^10000 W 100TH AVE^BIRMINGHAM^AL^35200^^O|||||||0105I30001^^^99DEF^AN OBX|1|NM|^Body Height||1.80|m^Meter^ISO+|||||F OBX|2|NM|^Body Weight||79|kg^Kilogram^ISO+|||||F AL1|1||^ASPIRIN DG1|1||786.50^CHEST PAIN, UNSPECIFIED^I9|||A

If we can stop chuckling at "Nickell's Pickles," we can see that this looks very different from MARC but there is a similar phenomena going on. Each line is a field with components separated by pipes. The first component is a three-character (but now alphanumeric) field ID. MSH identifies message type, PID is patient identity. Each of these is separated into many subfields, in the case of PID we can make out an ID number, a name, date of birth, etc. Once again, the same basic concept of code-identified fields with various subfields, and once again represented as one field per line. This time, mercifully, the field separator is newline and the subfield separator is pipe. These are conveniently human-readable so I have not had to replace them with whitespace. Finally, we once again have the use of odd filler symbols, mainly ^.

^ needs to be used basically because of a limitation in the data model, there is no way to separate "subsubfields." Consider the address. "260 GOODWIN CREST DRIVE" has a space in it, spaces are quite acceptable. But the EHR in use, like most software, feels the need to separate components of the address into tidy fields. Space can't be used to separate subsubfields because it's used within the subfields. Newline can't be used because it's the field separator. So instead, ^ is used. Further, both ^ and ^^ are used to represent subsubfield separations of different orders. "BIRMINGHAM^AL" is essentially equivalent to "BIRMINGHAM AL" except that the use of ^ rather than space assures the parser that it is the separator between city and state, not a space within the name of the city. Humans are largely smart enough to figure out that there is probably no city called "Birmingham Al" and so the "AL" must be a state, but computers are not.

Alright, I'm going to try to stop talking now. But I want to follow up in a future post by going on at length about fixed-width fields and their long heritage, and also perhaps about the pipe as a field separator, which is something that's very widely seen in early (say pre-1995) formats but rarely seen today. That will bring me to the matter of the comma as a field separator, something that is in fact very common today and has turned out to be a monumental pain. Finally, I'll loop back to those ASCII control characters that MARC used and I removed for you, and wonder why no one uses them today.


>>> 2020-06-27 simple generalized message

As the local news warned us in the early 2000s, the internet is a scary place full of hidden dangers. One of these is HTML.

Let's begin this discussion of the internet's favorite markup languages with just a quick bit about XML. XML, or Xtensible Markup Language, is a complicated markup language which is highly popular with enterprise software and Microsoft. More seriously, XML was introduced in mid-'90s as a highly standardized markup language which could be used for a wide variety of different purposes while still being amenable to consistent parsing and validation. This was achieved by making XML "extensible" in the sense that multiple schemas and document type definitions (DTDs) can be used and combined to allow XML to express nearly anything---while still being conformant to a well-defined, standard schema. But we're not here to talk about XML.

Conforming to a well-defined, standard schema is a lot of work and not very fun, so naturally XML has fallen out of fashion. First, the community favored YAML over XML. YAML is a markup language which appears, on first glance, to be very simple, but as soon as one looks beneath the surface they discover a horrifying Minotaur's labyrinth of complex behavior and security vulnerabilities in the making. Partially in response to this problem but mostly in response to the community losing interest in every development target that isn't Google Chrome, YAML itself has largely fallen out of favor and been replaced by JSON, except for all of the places where it hasn't. Also there is TOML. But we're not here to talk about markup languages either.

We're going to talk about HTML.

Computers Are Bad pop quiz: do you, in your heart of hearts, believe that HTML is a form of XML?

If you answered yes, you are wrong. But, you are wrong in a very common way, which seems to be rather influential. That is what we're here to talk about.

It's actually kind of clear on the face of it that HTML is not derived from XML. The first XML specification was published in 1998; depending on how you look at it HTML was first in use somewhere between 1990 and 1995. In fact, both HTML and XML are derived from a now largely-forgotten standard called SGML, or Simple Generalized Markup Language, which traces its history back several decades before HTML or XML. HTML and XML have a familial resemblance because they are siblings, not parent and child.

This has some interesting implications. To really get at them, we need to look a little bit at SGML. The following is a valid SGML snippet:


The following is a valid HTML4 snippet:

    <li>Item one
    <li>Item two

These look very similar but---HOLD ON A MOMENT---the SGML version has some weird business going on, and in the HTML version the li (list item) elements are just dangling with nothing on the other side! I am exaggerating as to the level of shock here, but my impression is that a lot of people with a moderate to even professional level understanding of HTML would be surprised that this is valid.

When XML was designed based on SGML, one of the explicit goals was actually to make the language simpler and easier to parse. This might be a surprise to anyone who has ever interacted with an XML parser. But the reality is that XML is easier to parse because it has a much stricter definition that makes XML documents more consistent from one to the next. One of these strict rules is that, in XML, all elements must be explicitly closed. This is a new rule introduced by XML: in SGML, there is not only a compressed syntax to close an element (</>) but closing elements is often optional. No closing tag is required at all if the parser can infer that the element must have closed from the context (specifically when a new element starts which cannot be nested in the prior).

We do something pretty similar in English. When we tell stories, we typically omit stating that we stopped doing something because most of the time that can be inferred from the fact that we started doing something else. This works well because humans have an especially sophisticated ability to interpret natural language using our understanding of the world. Computers do not understand that the world exists, so enforcing very strict rules on the construction of languages makes it easier for computers to understand them. This would all be basic knowledge to anyone with a CS degree and/or who has heard of Noam Chomsky as a linguist rather than as a socialist, but it's still pretty interesting to think about. As a general rule, the better a language is for a computer, the worse it is for humans!

So XML made the decision to be annoying to humans (by requiring that you explicitly state many things that could be inferred) in order to make parsers simpler. HTML, being derived from SGML instead, requires that parsers be more sophisticated by allowing authors to elide many details.

Perhaps you can imagine where this goes wrong. In fact, for various reasons that range from loose specifications to simple lack of care, HTML parsers were both extremely complex and extremely inconsistent. This reached a peak in the late 2000s as many webpages either only worked properly in certain web browsers or had to include significant markup dedicated to making single web browsers function properly. While there was some degree of blame all around, Microsoft's Internet Explorer was the main villain both because its developers had a habit of introducing bizarre non-standard features and because Microsoft is fundamentally hateable. Because of MSIE's large market share, the de facto situation was that many webpages functioned properly in MSIE but not in, say, Netscape Navigator, err, uhh, Firefox, even though Firefox was the browser that did a better job of adhering to the written standards.

This situation led to a fairly serious backlash in the web community. While some things of real import happened like an EU antitrust case, more significantly, it became fashionable to declare in the footer of websites that they were Standards Compliant. Yes, admit it, we are all guilty here.

But something else rather interesting happened, and that's XHTML. In the late '90s, work started on a new variant of HTML which would actually be based on XML, and not on SGML. This had the advantage that XML parsers were simpler, and so web browser HTML parsers could be simpler, more consistent, and have better and more consistent handling of errors. At the time, essentially no one cared, but as the browser wars escalated a more consistent specification for HTML, which was more amenable to exact parsing and machine validation, started to look extremely tempting.

Further adding to XHTML's popularity, the same time period was a high point in interest in the "semantic web." Because XHTML is Xtensible, arbitrary XML schemas could be embedded in XHTML documents to semantically express structured data for machine consumption, along with presentation logic for display to humans. This is the kind of thing that sounds extremely cool and futuristic and no one actually cares about. The Semantic Web was much discussed but little implemented until Google and Facebook started imposing markup standards which were significantly less elegant but required for good search rankings and/or native social media traffic, and so many SEO consultants transitioned from adding paragraphs of invisible text in the footer to adding weird meta tags to the header in order to look better in the Facebook feed. Now that is computing technology.

Most people who learned HTML in the 2005-2015 time period actually learned XHTML, and may not realize it. That's why, today, they strictly close all of their elements, including the empty ones.

This whole thing is made sort of funny by the fact that XHTML was rather short-lived. The release of the HTML5 specification in 2014 largely addressed all of the shortcomings of the HTML4.1 specification, and obsoleted XHTML. Part of this is because HTML5 was the shiny new thing, part of it is because HTML5 largely integrated the features of XHTML in a more convenient fashion than XHTML, and part of it is because XML was very popular with Microsoft who is extremely hateable.

In the end, XHTML is essentially forgotten today, very quickly in internet terms although surely there are still plenty of websites out there written in it and not yet updated. Perhaps the bigger influence of XHTML is that all we Millenials are running around closing all of our elements explicitly, which is considerably ironic in a world where we are omitting whitespace from our JavaScript to save bytes. In fact, in a quick survey of HTML5 minifiers, most don't seem to remove unnecessary closing tags.

Of course, HTML parsers being what they are, it's guaranteed that there are parsers in use which will malfunction when presented with these completely standards-compliant documents! I love parsing.


>>> 2020-06-20 204 No Content

One of those things that nearly everyone knows about computers is that for some reason "404" means "file not found." Most people that work with computers seriously are aware that HTTP uses a set of three-digit numbers to report status back to the client, and that these codes are categorized by first digit. For example, the '2xx' codes generally mean 'success' and '200' means 'OK." The '4xx' codes mean that there is something wrong with the request, and '404' means that the requested file could not be found by the server.

Perhaps less widely known is where this whole idea of status codes comes from.

It's not unique to HTTP at all. Another widely used internet protocol, SMTP, uses a very similar scheme of three-digit codes in which, for example, '200' means something similar to 'OK' (really just that the server is sending back a 'normal' reply) and '4xx' codes indicate a transient failure, for example '422' means that the recipient's mail box is full (exceeding storage quota). This is obviously very similar to HTTP, down to the rough meaning of the first-digit categories.

SMTP was first formally described (by Jon Postel!) in RFC 821, dated 1982. HTTP was first formally described (by Tim Berners-Lee!) in RFC 1945, dated 1996. Both protocols saw limited internal use prior to being published in RFC format, but it's clear from the gap in years that SMTP is the older protocol. In fact, it's kind of fascinating to me to consider that HTTP was published when I was alive, as it seems so ubiquitous that it must be older than me.

Anyway, FTP was formally described (also by Jon Postel!) in RFC 765 dated 1980, and in fact FTP uses a set of three-digit numeric status codes that also match the categories used by HTTP. RFC 765 elaborates somewhat on the concept of the reply codes:

The number is intended for use by automata to determine what state to
enter next; the text is intended for the human user.

We must remember that it was 1980, a rather different day in computing, when we read that a separate numeric representation must be provided "for use by automata." Indeed, a set of state diagrams is provided in the RFC based on those codes. It's an extremely "early computer science" way to approach the problem of designing a protocol. That is to say, it makes perfect logical sense and is perhaps the best approach, but has been largely abandoned today because such a state diagram for a "modern" protocol would span kilometers.

The question that interests me is whether or not FTP is the origin of the concept of three-digit status codes or reply codes, and the rough categorization of 100 for continuation, 200 for OK, 300 for redirect, 400 for temporary error, and 500 for permanent error (HTTP uses those last two a little bit differently, for client-side and server-side error).

RFC 765 was not the first discussion of FTP, which, being a very obvious idea (what if we could use this newfangled network to move files around!), has a long history. Numerous earlier RFCs represent different stages in the development of the RFC protocol. The three-digit error codes seem to first appear in RFC 354, a revision of the draft standard. Previous revisions of the draft (and protocol, prior to being TCP-based) use one-byte binary error codes or do not specify brief numeric error codes.

RFC 354 conveniently states that the FTP error codes are similar to the RJE protocol. RJE, or Remote Job Entry, is a now forgotten protocol which was essentially a very early form of RPC (as now done with protocols like XML-RPC and arguably basically all network APIs). Indeed, RJE, as described in draft form in RFC 360, includes a very similar set of status codes (including 200 OK), except that it also uses the 0xx series of codes.

Confusingly, RJE incorporates FTP as a component of the protocol, but an earlier form of FTP based on NCP (not TCP) that uses one-byte status codes.

As suggested by the sequence numbers, RFC 360 is very close in date to the previously mentioned RFC 354, and explicitly mentions that the same set of status codes are intended to be applicable to "other protocols besides RJE (like FTP.)" The wording in these two RFCs would seem to imply that the idea originated with RJE and was then also applied to FTP; the two both had authors at MIT who were presumably sharing notes, and there is logical overlap between the two protocols including RJE essentially having an FTP "mode," which makes them difficult to completely separate.

This RJE protocol, as ultimately formally described in RFC 407 after revisions, was actually somewhat sparsely used. RJE protocols in general were mostly used with mainframe and time-sharing systems, which mostly predated ARPANET, and so already had their own various RJE protocols implemented by the vendor or the user (these were back in the days when owners of time sharing systems sometimes wrote their own operating systems to get a few features they wanted). This makes it pretty difficult to trace the history of RFC 407 in much detail, not least because the term "RJE" refers collectively to at least a dozen different such published protocols.

I was able to track down contact information for one of the authors of RFC 407, Richard Guida. Unfortunately he didn't recall how the reply code numbers came about, but I'm not especially surprised. Of course this was quite a long time ago, but the reply codes also seem like a relatively obvious idea that probably didn't strike anyone as particularly noteworthy at the time.

Notably, there is some precedent. The pre-TCP (NCP) version of FTP, which predates RFC 407 RJE, uses a one-byte reply code in a fairly similar way to RJE and TCP FTP. Speculatively, it seems likely that one of the authors of RJE (or possibly TCP FTP which seems to have been written out more or less in parallel) was familiar with the previous NCP FTP protocol and decided that replacing the one-byte reply code with a three-digit ASCII reply code would both be more human-readable (useful in a time when debugging protocol implementations by interacting with them "manually" was probably more common) and would allow for hierarchical organization by digit.

In fact, the hierarchy was somewhat more specific then. Both the RJE and TCP FTP specifications refer to the reply codes as being organized into three levels by hundreds, tens, and ones. HTTP makes no mention of such a three- level hierarchy, only the two levels of hundreds and ones. While Tim Berners-Lee was clearly inspired by the RJE/FTP reply codes, he did not duplicate their structure as faithfully as SMTP.

In summary, the three-digit HTTP status codes date back to at least 1972, and were already about a quarter decade old when they (or at least a similar set) were used for HTTP. We are now coming up on 50 years since 200 "OK" was first defined, and it does not seem likely that it will go away any time soon.

One might question the utility of having these numeric reply codes when there are also text explanations sent along with them. The original intent seems to have primarily been that the numeric codes were easier to parse and use in software. That said, all the way back, protocols which use these codes have stated that the text representation is not bound to a specific string. This means that a 404 error is a 404 error regardless of whether or not the accompanying text error is 'File Not Found,' which could allow for internationalization or just unusual server configurations.

Of course, in the world of HTTP, these errors are almost always represented to the end user in the form of a dedicated page designed to express the error. As a result, the actual HTTP status code and conventional error string "File Not Found" are basically irrelevant. That said, both browsers and servers have long had default representations of these errors which included the literal phrase "404 File Not Found," and this has pushed the status code and error string into the cultural lexicon firmly enough that they remain in common use on custom- designed error pages that could say whatever they want.

In the end, a fairly minor detail of a network protocol could end up influencing the popular culture fifty years later. Kind of makes you nervous about your API designs today, doesn't it?


>>> 2020-06-18 ASCII

There is an interesting little chapter of computer history involving ASCII and Japan.

ASCII is, of course, the American Standard Code for Information Interchange. I often say that computer science is an academic discipline principally concerned with assigning numbers to things. Of the many things which need numbers assigned to them, the letters of the alphabet are perhaps one of the most common. ASCII is a formal standard, derived from several informal ones, for allocating numbers to all of the characters which were deemed by the computer industry to be important in 1963. It likely requires no explanation that ASCII accounts only for the English language and American currency.

ASCII itself is not especially interesting, besides to note that it is in fact a seven bit code, which leads to the important "computers are bad" theme of what it means for a system to be "eight bit clean" and why some systems are not. That is a topic for a later day, though. Today I will constrain myself to ASCII and Japan.

Japan, of course, principally uses a language which cannot be represented by the 127 code points of ASCII, most of which are English characters and punctuation and the rest of which are control characters no one can be bothered to remember[1]. At the same time, Japan was the first adopter of computer technology in East Asia and, by many metrics, one of the first adopters of computer technology outside of the United States. Considering that nearly all early computers either used ASCII or an even smaller character set, this raises an inherent problem, which was largely resolved by the introduction of various Japan-specific character sets (often called "code pages" by earlier computer systems), which eventually mostly consolidated into SHIFT-JIS.

And yet, in Japan, ASCII was for a time a very big deal. I am talking, of course, not about the US cultural dominance of Japanese industry being forced to at least partially use Roman characters due to the limitations of technology designed in America, but rather to the ASCII Corporation.

The ASCII corporation published ASCII Magazine, which was the preeminent computer technology magazine of Japan. Being published in Japanese, ASCII magazine was, of course, not representable in ASCII. Most interestingly, ASCII Corporation was, for over a decade, the Asian sales division for Microsoft. Microsoft and ASCII collaborated to design an open standard for personal computers called MSX, which was on the market at the same time as the IBM PC and ultimately failed to gain more traction than PC clones. That said, Microsoft's experience with MSX, along with the PC, was no doubt one of the motivators in Microsoft's broader philosophy of decoupling the hardware vendor and software vendor[2].

This is all somewhat aside the curiosity of the name ASCII. I have found limited historical information on ASCII Magazine. In part this is because the original material is in Japanese, but I have noticed a more general trend of historians of computer history being oddly uninterested in the popular publications. The kind of excessively concise summary usually given of ASCII Magazine's history is typical of the US computer hobby magazines as well.

What is fairly well documented is that the key founder of ASCII magazine and the ASCII corporation had recently visited industry events in the US, and of course Japanese computer hobbyists would have been well exposed to ASCII due to the common use of imported American and British computers. It seems likely that the founders simply chose a "computer-ey" term that sounded cool, nearly all such terms being of course divorced from their original meanings when borrowed into Japanese.

The introduction of computer technology into foreign markets is the kind of topic that you could write many books about. The case of Japan is interesting for being perhaps the first major market for American and British computer companies which used characters other than the Roman alphabet, essentially introducing the problem of internationalization which we know and love today. Some time later Arabic lead to a second round of the effect as software had to be made to account for right-to-left layout. Both of these are still very much real problems today, with character encoding confusion and RTL layout failures a common experience for users in these regions.

Character encoding failures are relatively unusual for English speakers. This is mostly because a large portion of character encodings (including, most importantly, Unicode) are derived from ASCII and share the ASCII code points in common---the ASCII code points being pretty much all that's used in American English, and nearly all that's used in British English except for that problematic £. Of course ASCII does not account for certain aspects of English typography such as ligatures and various lengths of dashes, and these are now often viewed as unnecessary flourishes as a result. It's hard to blame any of these problems entirely on computers, though, as the same issues were present (and sometimes more severe) in typewriters.

There is, in general, a large factor of "first-mover advantage" here. Computer technology was largely developed in the US and UK and so it was largely designed around the needs and sensibilities of English-speaking users. On the other hand, there is also a phenomena of "first-mover disadvantage," which is exemplified by the European cable television standard (PAL) having been generally superior to the US standard (NTSC) due to being developed several years later when better electronics were available. But, then, PAL networks ended up delivering a lot of content that had been (crudely) scaled from NTSC, because of the cultural dominance part[3].

The other non-English-speaking country with significant early computer development was Russia. Because most of this development happened behind the iron curtain and under state (and specifically military) purview it is not always as well documented and studied, especially from the US perspective[4]. By the same token, internationalization of English technology to Russian (and vice versa) was relatively uncommon, and Soviet computer history is essentially its own separate but parallel process.

One of the thorniest areas for internationalization is in the tools themselves. Out of the wide world of programming languages, ALGOL is almost unique in having been intended for internationalization. ALGOL was "released" in multiple languages, with not only the documentation but also the keywords translated. There have been occasional "translations" of programming languages out of English but none have ever been successful on any significant scale. If you are truly interested you can, for example, obtain a compiler for C++ but in Spanish. No one who speaks Spanish actually uses such a thing.

The dominance of English in computer tooling is exemplified by Yukihiro Matsumoto's Ruby programming language, which uses keywords in English rather than Matsumoto's native Japanese, even though it was initially little known outside of Japan. English is thought to be the "lingua franca" of programming, a term which is a bit ironic in that one of my most frustrating personal stories of software was my going in to solve a simple problem in some open source software, only to find that the comments and symbols were entirely in French. Quoi?

[1] There's actually kind of a neat trick where if you lay out the ASCII table in four columns it makes a lot of intuitive sense. This is a lot like saying that if you count the letters in every word of the Bible you will hear the true word of God.

[2] At the time this was referred to as an Independent Software Vendor or ISV. Today, the concept of the software being developed by a different firm than the hardware is so normalized that the term ISV is rarely used and comes off as slightly confusing. Where once Microsoft had stood out for being (mostly) an ISV, now Apple stands out for being (mostly) not an ISV.

[3] The poor quality of early NTSC-to-PAL conversions was one of many things lampooned by British satire series "The Day Today," where the segments from their supposed American partner network featured washed-out colors, a headache-inducing yellow tint, and intermittent distortion. This was indeed a common problem with American content broadcast in Britain, prior to the use of digital video. British content broadcast in America seems to not have suffered as much, probably because the BBC made more common use of the "kinescope" technique in which the television recording was exposed onto film, which was then recorded back into television in the US using NTSC equipment.

[3] This is quite unfortunate because a combination of pursuing alternate paths and wartime/economic challenges lead Soviet computer development into some very interesting places. Vacuum tubes were used in the USSR well after their falling out of favor in the USA, which lead to both some amazing late-stage vacuum tube designs as well as Russia being the world's leader in vacuum tube technology today.


>>> 2020-06-14 manifest telephone

Some time ago I got into a discussion online which led me, once again, to articulate my belief in the spiritual significance of the telephone. I will try to articulate the point, somewhat more clearly, here.

Lately I have been reading Marc Reisner's "Cadillac Desert," an excellent and important book about the large-scale waste and destruction of the West's water resources. The book has been compared by some to "Silent Spring," which I think simultaneously illustrates that it is a good book on an issue of critical importance, but also shows the sad state that "Silent Spring" more or less triggered an environmental movement while the issues "Cadillac Desert" discusses have seen virtually no progress today.

Well, that's a bit besides the point, but there is something that Reisner talks about in the book that I think is important. From the beginning he explains that the projects to irrigate large areas for farming in the West were always economically undesirable. That is, consistently, the cost of building the irrigation project was much larger than the value of the farming it enabled. Yet, these projects were very politically popular, at most times across both parties---including the fiscal conservatives. So, one wonders, if not money, and if not agricultural production itself (as these projects frequently only enabled production of crops already available in excess), what led to all of these dams and waterworks?

Reisner argues that, in the American West, irrigation is a religious issue rather than a practical one. There is some justification for this right off the bat by observing that the Bureau of Reclamation was established principally by Mormons for whom it was quite literally a religious issue, but that almost misses the point. The important thing is that irrigation projects were pursued because they were righteous, because they were an important component of American ideals, the American ideal being, of course, fertile land, not open desert. The appeal of irrigation as a religious project to civilize the West drove politicians and engineers to pursue these works beyond all reason.

Of course, this sounds rather familiar, doesn't it. Most of us in school learn about a prominent spiritual movement with an impact on the West, and that is Manifest Destiny. In fact, the development of enormous irrigation works in the West like the Hoover Dam is, essentially, an extension of Manifest Destiny, but in the ever more ambitious sense that we ought not just settle the West but change it to fit our Eastern sensibilities.

The effect, I think, is not restricted to irrigation.

By near universal agreement, the concept of Manifest Destiny in textbooks, school lessons, and Wikipedia is illustrated by the painting "American Progress" by John Gast, which depicts droves of settlers headed west by horse, wagon, and train. Prominently, though, in the foreground, the painting features the lady Columbia headed west as well and stringing, behind her, a telegraph line.

When Gast painted "American Progress" in 1872, AT&T (or then, the Bell Telephone Company) had not yet quite been founded. The Long Lines division, with its explicit goal of connecting the nation, would not be established until six years later. Gast was most likely thinking at the time of the railroad telegraphy system and the early telegraph giants like Western Union[1].

One of the many lessons of the early 20th century is that it is difficult to operate any national enterprise when it takes weeks to convey messages between offices[2]. The railroads and the financial industry were some of the largest organizations to run into these problems, which is of course why they were early adopters of telegraphy.

It was in this context that AT&T got off the ground. While the divide between telephone and telegraph back then was somewhat larger than it is today (telephones having been enormously expensive early on), there was still a sense that the telephone was solving the same problem as the telegraph, and perhaps better. AT&T was at least a spiritual successor to Western Union, as well as claiming away most of their business.

AT&T really gained steam in the early 20th century. It was 1907 when AT&T essentially announced its intent to become a monopoly---"One System." Manifest Destiny had largely petered out by then, but I would argue that within AT&T, the spirit of "Telephone as Civilization," "Telephone as Progress of Man," and "Telephone as American Ideal" was stronger than ever. In fact, it was AT&T's rapidly acquired monopoly status that facilitated this fervor. Religious values do not especially thrive under capitalism, but AT&T was not subject to capitalism: they weren't just a phone company, they were the phone company, and the regulation that oversaw their monopolized service was just as devout in the religion of the telephone as they were.

This view of the telephone as religion can shed useful insight into the behavior of AT&T up to (and surely to some extent after) the breakup, but perhaps most significantly is a way of analyzing how AT&T changed after the breakup.

Prior to the breakup, AT&T expanded and improved their network with religious zeal. This dedication to their cause lead to the establishment of Bell Laboratories and, ultimately, to the transistor and in many ways to the computer. At the same time, it led to high consumer rates, because rates were determined not competitively but by AT&T's insatiable desire to invest.

Telephone was a religion less in the sense of Jesus Christ and more in the sense of George Washington. In the early 20th century these two were hard to separate from each other, Washington's apotheosis having been illustrated relatively recently. The First World War, and much more so the Second World War, challenged deities in more ways than one, and by 1950 Nietzsche would presumably have declared Gen. Washington to be dead.

AT&T, though, by merit of its unusual position as a protected monopoly, continued through the mid-century with a strong belief in its own god and continued to adulate it with the dial tone. Saul Bass's 1969 pitch reel introducing AT&T's new corporate branding system depicts some of this spirit, along with an example to remind us that the apparent insanity of the "Gravitational Pull of Pepsi" is a not a new phenomenon. This video is available on YouTube courtesy of the AT&T Archives and you absolutely must watch it, several times, if nothing else to appreciate the truly period fashion sense espoused in the new uniform designs.

More to this point, though, the video is a brilliant artifact of the trailing end of the period in which Telephone Men were an institution as strong as letter carriers once were, Telephone Women wore uniforms behind the switchboard to be seen by no one but themselves, and Telephone Executives had little Bell logos embroidered on their french cuffs.

Yes, it's a work of corporate branding and so essentially a work of corporate advertising and everything is shown in its best possible manifestation. But there are hints of the kind of care that we don't often see today. Outside plant crews wore a uniform under their coveralls, the coveralls to be removed whenever they entered a customer's premises to avoid bringing in dirt and grease. At least, this was the goal. Today the usually subcontracted telephone technicians set the lofty goal of arriving within a four-hour window and mostly miss it. I once had a long conversation with a technician subcontracted by CenturyLink about how he hoped to buy a self-service car wash and get out of the whole telephone mess. This conversation occurred as he frowned at his instruments and worried to me that the infrastructure was simply in too poor of repair to get VDSL to work on more than one pair. After over an hour of walking back and forth between house and pedestal, interspersed with phone calls related to said car wash acquisition, he declared it impossible to provide me the service I had tried to subscribe to. I rate this as a very positive interaction with CenturyLink's consumer division because he arrived, admittedly at the wrong time, but on the correct date, and at least put on the appearance of exerting real effort before declaring the telephone system hopeless.

This is all very anecdotal, of course, but the real point to examine is that of reliability. The first commandment of the religion of the telephone is "Thou shalt deliver a dial tone always." Much like Reisner's irrigation engineers fervently executed projects which would return cents on the dollar (in the best case), the Bell Systems' engineers invested their effort in chasing out yet another "nine" in reliability which would be hardly noticed by customers. Electronic telephone switches were built for enormous redundancy. WECo's[3] installation service coordinated armadas of technicians like choreographing dancers to transfer customer lines from an old switch to a new one in a matter of minutes and with only seconds of interruption per customer. In perhaps the crown Jewel of the Bell Systems' dedication to reliability, in 1930 the Indiana Bell building was moved in its entirety to make room for a new larger one---all while in active use, utility cables dragged slowly behind and a wooden walkway, practically airstairs, wheeled along with the building's entrance so that the staff could come in and out of their offices as usual.

In most industries, a service interruption might be scheduled to facilitate cutover to a temporary switch, then again to cut over to a new one. The Bell system routinely managed replacement of switches with zero downtime using strategies that varied from complex (splicing switching devices into in-use telephone wiring to prepare for "all at once" cutover) to whimsical (lining up in rows along the distribution frame, cable loppers in hand, to cut out the old switch in time to a supervisor's whistle).

There was, of course, a fall from grace.

I said that religion does not thrive under capitalism, and of course this was the fate met by the Bell system. The breakup of the Bell system in 1982 occurred primarily in response to their very high rates, which were (accurately) seen as symptomatic of the monopoly they enjoyed. The breakup was successful in reducing rates and was a key step towards the situation we have today in which multiple competitive cellular carriers are (mostly) driving their rates downwards over time.

But, of course, it is clear that AT&T's high rates were not exclusively a result of privileged profiteering. They were also a result of AT&T's enormous R&D budget, their dedication to reliability, and their generous staffing from customer service to engineering. The telephone system was never quite the Garden of Eden but competitive phone service certainly was forbidden fruit.

While costs have decreased tremendously, so have reliability and quality of service. The surviving fragments of the Bell System are now some of the most hated companies by consumers. They're often second only to their later upstart competition, the cable television carriers, which exist in a similar state of sin but, having grown up entirely in such a fallen state, lack even the memory of their former grace to moderate their avarice. I'm not sure what the seven deadly sins of the religion of the telephone are, but I can promise that Comcast is guilty of every one.

There is a great deal of economic analysis which can be done to explain the changes that the telephone system underwent after the breakup of the Bell system. The truth is sufficiently complex that it's hard to say whether or not the whole thing was a good idea. What does seem certain is that it was inevitable; if competition was the forbidden fruit, MCI was the serpent. Or perhaps Carterfone? Maybe Carterfone is the serpent and Sprint and MCI are Cain and Abel. I don't know, the metaphor could use more work.

All of this depicts a rather rosy and simplified view of the whole situation. Of course pre-breakup AT&T was far from pure virtue, and post-breakup there have been meaningful improvements in consumer service. The poor reputation of the telecom industry today has in part to do with market and social forces that probably would have existed regardless, late-stage capitalism and all, and a radically different world in which AT&T had, say, been nationalized and MCI, Sprint, etc. bought out by the new American Telephone and Telegraph Administration, taxpayer dollars at work, would presumably have all of its own downsides. Nothing is so simple. I'm just here to tell a nice story, though, and maybe there's some insight in it.

While the economic and regulatory analysis is important, I think it misses some of what happened: beyond a financial aspect, there is a social aspect to the history of the telephone system, and a good part of that social aspect is the rise and fall of a religion: not God's chosen people, but the Telephone Men. The telecom industry was already giving in to vice by the time of the breakup, but the breakup was the crisis of belief that led to complete atheism, and then, moral relativism. Or at least tariff relativism.

Whatever happened to traditional telephone values? The market is what happened. Well, the market and everything else.

[1] For the late 19th and early 20th century, the telegraph system was conjoined at the hip to the railroads, both being fundamentally involved in finding long-distance rights-of-way and the railroads relying in part on telegraphy for their own business. While railroads sometimes constructed their own telegraph lines they also often contracted this to Western Union. Until 1960, WU had work crews which lived in modified passenger trains to maintain WU equipment on railroad RoW.

[2] A rather vivid demonstration of the slow travel of news prior to the dual revolutions of the railroad and telegraph is California's admission as a state. On Sept. 9 1850, California was admitted to the United States. No one in California knew this fact until Oct. 18, over a month later, when the ship Oregon arrived having carried goods---and incidentally news---all the way around South America. This was a long journey, but the overland trip from East to West was even longer. Incidentally, of personal interest, New Mexico was established as a territory at the same time.

[3] Like Bell Labs was the research and development arm of the Bell System, Western Electric Company or WECo was the manufacturing arm, which built and serviced designs out of Bell Labs, and did no small amount of R&D on its own. Like Bell Labs, WECo was lain fallow after the Bell breakup. What remains is scattered across the telephone industry, especially Avaya, but the core of WECo, along with Bell Labs, is now part of Nokia.

<- newer                                                                older ->