We've talked a little bit about markup languages. Broadly speaking, and to use a taxonomy which I completely made up by myself, most markup languages in use for data interchange today are either enclosure-style, in which each element is enclosed by start and stop delimiter (eg. HTML, XML), or key-value style, in which the file consists more or less of a list of keys and values which may be enclosed in various ways to indicate structures like maps and lists (e.g. YAML and JSON). Of course there are many others as well and I'm speaking only of data interchange here, not more general markup, but the point stands that these two families are mostly what we use today when we need to get structured data from one thing to another.
Just trying to organize things this way brings us to a somewhat complex question: what exactly is a markup language? My carefully constructed (in about thirty seconds while slightly inebriated) taxonomy happens to exclude, for example, markdown and RST, which would generally be called markup languages. This is partially because I'm just focusing only the things that are interesting to me in this case, but it's also partially because the concept of a markup language and/or a data interchange format are somewhat loosely defined.
Wikipedia, which is never wrong, says that "a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text." This definition, on a plain reading, clearly includes HTML, Markdown, RST, and many others. Things get a little weird when we look at XML. It has Markup Language right in the name, and it can certainly be used in a fashion similar to HTML (see: the last post), but it often isn't. In cases like XML, and even more so with YAML, the argument that the markup is just an annotation on the text becomes a lot harder to defend. I would be tempted to refer to these as "data interchange formats" rather than "markup languages," but that term is already in use for something different. We could also call them "serialization formats" but people tend to associate that term more with binary formats. So the basic terminology is rather confusing here, and if I had a bit of common sense that's what I'd be trying to taxonomize.
The point of all of this is that I would like to talk a bit about formats which are used for interchanging data between different systems (or occasionally for storing and retrieving data within the same system). These are often called markup languages but are probably not really markup languages in that they do not focus on annotating (or marking up) text, instead they express data structures which may contain text but are not necessarily text documents. These are "markup?" languages like XML, YAML, JSON (this one doesn't call itself a markup language!), and various others. And specifically, I am talking about the ones that are text-based, as opposed to binary formats like protobuf and others.
As mentioned previously, XML dates back to roughly 1998. YAML came about in 2001, not that much later, but became popular probably more around the mid to late 2000s when it was viewed as the antidote to XML's significant complexity. Most people don't realize that YAML is probably just as complex, because it looks very simple in the minimal examples that most people constrain themselves to.
XML has SGML as an antecedent, and SGML is derived from IBM formats which date back to 1970 or so. Interestingly, this ancient ancestor of XML (called GML, because it was before Simple GML), has a certain superficial resemblance to YAML, at least in that it involves significant use of colons. That's a bit interesting as YAML does not have any clearly described ancestors.
So how does GML work? Well, it worked much like SGML in having start and end tags, but tags were started with a colon and ended with a period, rather than using the greater than/less than symbols. But GML also had a very strong sense of being line-oriented, that is that tags generally went on their own line, which is a bit more similar to YAML than to SGML.
In fact, the great bulk of early data interchange formats were line-oriented. There are various reasons for this, chief among them that it is simply intuitive to put "one record per line," as it matches conventional tabular formats that we're familiar with in print (e.g. tables). It was also essentially a technical constraint of punched-card based computer systems, where "line" and "file" (in the modern sense) were more or less equivalent to "card" and "stack" when working with punched cards---that is, each card was considered a line of text. That each card could be called a "record" and a set of records made up a file shows the degree to which electromechanical punched card systems, and the computers derived from them, were intended to model pre-computer business records kept as lines in ledgers.
Overall I have found it extremely difficult to trace any kind of coherent history of these formats, which is probably reflected in how disorganized this message is. Many old data interchange formats have familial resemblances to each other, giving the tantalizing suggestion that a "family tree" could be traced of which were based on which others, but actually doing this would probably require a great deal of original research and I have both a full-time job and hours of standing in the living room staring at the wall to keep up with, so while I have made some tentative forays into the matter I do not expect to publish a treatise on the origins of XML any time soon.
Instead, I would like to mention just a few interesting old data interchange formats and some things we can learn from them. Most of these examples are old, all of them come from a context in which a body of experts attempted to design a single, unified data model sufficient to meet all the needs of a given problem domain. This has profound implications. I have said before and I will say again that computer science is the discipline principally concerned with assigning numbers to things. In the realm of computer science (and specifically AI, in the original meaning of AI, not the marketing buzzword of today) research, the term "ontology" is borrowed from philosophy to refer to defining the nature of things. That is, ontologists in CS do not seek to establish what is, they seek to represent what is. This is perhaps the highest-level academic discipline of assigning numbers to things and deals with fundamental and theoretical questions about how computer systems can represent and manipulate complex domains of knowledge. While the ontologists of philosophy ponder what does and can exist, the ontologists of computer science ponder how to punch all of that onto paper cards.
XML is not exactly a masterpiece of ontology, but there is a whiff of ontology throughout the world of data interchange formats. Designing a domain-specific interchange format requires considering all of the areas of knowledge in that domain and assigning codes and keywords to them. Designing generalized interchange formats requires considering all of the structures of knowledge that need to be expressed. Because the set of data structures in use by computer systems is in practice highly constrained by both the limits of technology and the limits of the people who use the technology (essentially everything in life is either a map or a list, regardless of what your professors told you about bicycles and inheritance), it seems that in practice creating a generalized markup language is almost the easier of the two efforts. At least JSON is really dead simple. Of course, for generalized languages which support schemas, schemas tend to bring in domain-specific knowledge and all the complexities thereof.
So let's forget about generalized markup languages for now and jump back to a time in which generalized markup languages were not in widespread use and most software systems exchanged data in domain-specific formats. These domain-specific formats were often being developed by domain experts using very careful consideration of everything which may need to be represented. We see in this pursuit both complex theoretical problems in computer science and the ways in which large parts of computer science (generally the more applied assigning of numbers) are derived from information or library science.
That was an extremely long preamble to get to the actual point of this message, but hopefully it provides a bit of context into why I am about to tell you about MARC.
If I am to argue that we can blame large parts on computer science on library science, MARC is my key piece of evidence. Librarians and other information science types are deeply concerned withe the topic of "authority control," which is basically about being able to uniquely identify and look up information based on standardized names. A book ought to have one title and one author (or set of authors) which can consistently be used to look it up, even though people are prone to use abbreviations and write names in different ways. A similar problem is seen in genealogy where the spelling of family names often drifts from generation to generation, but researchers tend to consider "McLeod" and "MacLeod" to be the same name despite the variable spelling. You could argue that when Google corrects your spelling errors it is practicing a form of authority control by standardizing your query to the authorized vocabulary.
Yes, authority control tends to be based around the idea of establishing a restricted vocabulary of standardized, or authorized, names. J. R. R. Tolkien, John Ronald Reuel Tolkien, and my insistence on misspelling it J. R. R. Tolkein ought to all be standardized to the same authorized name, so that a query for any of these representations returns all of his books. "Tolkien, J. R. R." according to the library catalog. This idea of a standardized, constrained vocabulary will be familiar to anyone in computing as it's the same kind of thing we have to think about when dealing with computers. MARC rests at exactly the intersection of the two.
MARC is short for Machine-Readable Cataloging. It was developed for the Library of Congress in the 1960s for the purpose of representing the library catalog in computer form. It is still in fairly common use today as a "lowest common denominator" interchange format between library cataloging software developed by different vendors. While there is an XML variant today, MARC is most widely seen in its original, 1960s format that looks like this:
008 180410b ||||| |||| 00| 0 eng d
020 _c EC$20.00 (cased).
100 _a Tolkien, J.R.R.
245 _a The silmarillion /
_c J.R.R. Tolkien ; edited by Christopher Tolkien.
260 _a London :
_b Book Club Associates,
300 _a 365 p. ;
_c 23 cm.
500 _a Includes index.
650 _a Baggins, Bilbo
650 _a Middle Earth (Imaginary place)
Of course, this is not exactly what it looks like. This is in part because I have omitted certain fields to make it more readable, but it's more so because the standard representation of MARC makes use of non-printable ASCII control characters to separate fields, and not the newline. I have swapped out these control characters for newlines and spaces and then indented to make things more clear. I have also omitted some junk that comes out of the details of the format such as a bunch of extra slashes. The point is that I have made this format look tremendously more human-friendly than it actually is.
MARC consists of fields, each identified by a three-digit number. Fields may have subfields, identified by a letter. For example, field 245 is Title Statement. Subfield A is Title, subfield C is "statement of responsibility, etc." according to the LoC documentation. Not all of these fields make so much sense. Field 008 is called "fixed-length data elements" and is part of the control fields (00x fields). It contains things like date the book was added to the catalog, where the catalog data came from, but also some less control-ey data like "target audience." But all of this is combined into one field using a fixed-width format, and the pipe is for some reason used as a "fill" character for fields which are required but have no data.
This idea of enumerating every field that might need to be expressed and then assigning numerical codes to them is a common aspect of early data interchange formats. I will show one other example before ending this rather long message and leaving more for later. That's a 1980s-vintage format that I have the pleasure of dealing with in my current day job, called Health Level 7 or HL7. HL7 serves as a "lowest common denominator" format for exchange of data between different electronic health record systems. An example HL7 record, courtesy of Wikipedia, follows, but note that I have removed some fields for brevity.
If we can stop chuckling at "Nickell's Pickles," we can see that this looks very different from MARC but there is a similar phenomena going on. Each line is a field with components separated by pipes. The first component is a three-character (but now alphanumeric) field ID. MSH identifies message type, PID is patient identity. Each of these is separated into many subfields, in the case of PID we can make out an ID number, a name, date of birth, etc. Once again, the same basic concept of code-identified fields with various subfields, and once again represented as one field per line. This time, mercifully, the field separator is newline and the subfield separator is pipe. These are conveniently human-readable so I have not had to replace them with whitespace. Finally, we once again have the use of odd filler symbols, mainly ^.
^ needs to be used basically because of a limitation in the data model, there is no way to separate "subsubfields." Consider the address. "260 GOODWIN CREST DRIVE" has a space in it, spaces are quite acceptable. But the EHR in use, like most software, feels the need to separate components of the address into tidy fields. Space can't be used to separate subsubfields because it's used within the subfields. Newline can't be used because it's the field separator. So instead, ^ is used. Further, both ^ and ^^ are used to represent subsubfield separations of different orders. "BIRMINGHAM^AL" is essentially equivalent to "BIRMINGHAM AL" except that the use of ^ rather than space assures the parser that it is the separator between city and state, not a space within the name of the city. Humans are largely smart enough to figure out that there is probably no city called "Birmingham Al" and so the "AL" must be a state, but computers are not.
Alright, I'm going to try to stop talking now. But I want to follow up in a future post by going on at length about fixed-width fields and their long heritage, and also perhaps about the pipe as a field separator, which is something that's very widely seen in early (say pre-1995) formats but rarely seen today. That will bring me to the matter of the comma as a field separator, something that is in fact very common today and has turned out to be a monumental pain. Finally, I'll loop back to those ASCII control characters that MARC used and I removed for you, and wonder why no one uses them today.
As the local news warned us in the early 2000s, the internet is a scary place
full of hidden dangers. One of these is HTML.
Let's begin this discussion of the internet's favorite markup languages with
just a quick bit about XML. XML, or Xtensible Markup Language, is a
complicated markup language which is highly popular with enterprise software
and Microsoft. More seriously, XML was introduced in mid-'90s as a highly
standardized markup language which could be used for a wide variety of
different purposes while still being amenable to consistent parsing and
validation. This was achieved by making XML "extensible" in the sense that
multiple schemas and document type definitions (DTDs) can be used and combined
to allow XML to express nearly anything---while still being conformant to a
well-defined, standard schema. But we're not here to talk about XML.
Conforming to a well-defined, standard schema is a lot of work and not very
fun, so naturally XML has fallen out of fashion. First, the community favored
YAML over XML. YAML is a markup language which appears, on first glance, to be
very simple, but as soon as one looks beneath the surface they discover a
horrifying Minotaur's labyrinth of complex behavior and security
vulnerabilities in the making. Partially in response to this problem but mostly
in response to the community losing interest in every development target that
isn't Google Chrome, YAML itself has largely fallen out of favor and been
replaced by JSON, except for all of the places where it hasn't. Also there is
TOML. But we're not here to talk about markup languages either.
We're going to talk about HTML.
Computers Are Bad pop quiz: do you, in your heart of hearts, believe that HTML
is a form of XML?
If you answered yes, you are wrong. But, you are wrong in a very common way,
which seems to be rather influential. That is what we're here to talk about.
It's actually kind of clear on the face of it that HTML is not derived from
XML. The first XML specification was published in 1998; depending on how you
look at it HTML was first in use somewhere between 1990 and 1995. In fact, both
HTML and XML are derived from a now largely-forgotten standard called SGML, or
Simple Generalized Markup Language, which traces its history back several
decades before HTML or XML. HTML and XML have a familial resemblance because
they are siblings, not parent and child.
This has some interesting implications. To really get at them, we need to look
a little bit at SGML. The following is a valid SGML snippet:
These look very similar but---HOLD ON A MOMENT---the SGML version has some
weird business going on, and in the HTML version the li (list item)
elements are just dangling with nothing on the other side! I am exaggerating as
to the level of shock here, but my impression is that a lot of people with a
moderate to even professional level understanding of HTML would be surprised
that this is valid.
When XML was designed based on SGML, one of the explicit goals was actually to
make the language simpler and easier to parse. This might be a surprise to
anyone who has ever interacted with an XML parser. But the reality is that XML
is easier to parse because it has a much stricter definition that makes XML
documents more consistent from one to the next. One of these strict rules is
that, in XML, all elements must be explicitly closed. This is a new rule
introduced by XML: in SGML, there is not only a compressed syntax to close
an element (</>) but closing elements is often optional. No closing tag is
required at all if the parser can infer that the element must have closed
from the context (specifically when a new element starts which cannot be
nested in the prior).
We do something pretty similar in English. When we tell stories, we typically
omit stating that we stopped doing something because most of the time that
can be inferred from the fact that we started doing something else. This works
well because humans have an especially sophisticated ability to interpret
natural language using our understanding of the world. Computers do not
understand that the world exists, so enforcing very strict rules on the
construction of languages makes it easier for computers to understand them.
This would all be basic knowledge to anyone with a CS degree and/or who has
heard of Noam Chomsky as a linguist rather than as a socialist, but it's still
pretty interesting to think about. As a general rule, the better a language is
for a computer, the worse it is for humans!
So XML made the decision to be annoying to humans (by requiring that you
explicitly state many things that could be inferred) in order to make parsers
simpler. HTML, being derived from SGML instead, requires that parsers be more
sophisticated by allowing authors to elide many details.
Perhaps you can imagine where this goes wrong. In fact, for various reasons
that range from loose specifications to simple lack of care, HTML parsers were
both extremely complex and extremely inconsistent. This reached a peak in the
late 2000s as many webpages either only worked properly in certain web browsers
or had to include significant markup dedicated to making single web browsers
function properly. While there was some degree of blame all around, Microsoft's
Internet Explorer was the main villain both because its developers had a habit
of introducing bizarre non-standard features and because Microsoft is
fundamentally hateable. Because of MSIE's large market share, the de facto
situation was that many webpages functioned properly in MSIE but not in, say,
Netscape Navigator, err, uhh, Firefox, even though Firefox was the browser that
did a better job of adhering to the written standards.
This situation led to a fairly serious backlash in the web community. While
some things of real import happened like an EU antitrust case, more
significantly, it became fashionable to declare in the footer of websites that
they were Standards Compliant. Yes, admit it, we are all guilty here.
But something else rather interesting happened, and that's XHTML. In the late
'90s, work started on a new variant of HTML which would actually be based on
XML, and not on SGML. This had the advantage that XML parsers were simpler,
and so web browser HTML parsers could be simpler, more consistent, and have
better and more consistent handling of errors. At the time, essentially no one
cared, but as the browser wars escalated a more consistent specification for
HTML, which was more amenable to exact parsing and machine validation, started
to look extremely tempting.
Further adding to XHTML's popularity, the same time period was a high point in
interest in the "semantic web." Because XHTML is Xtensible, arbitrary XML
schemas could be embedded in XHTML documents to semantically express structured
data for machine consumption, along with presentation logic for display to
humans. This is the kind of thing that sounds extremely cool and futuristic and
no one actually cares about. The Semantic Web was much discussed but little
implemented until Google and Facebook started imposing markup standards which
were significantly less elegant but required for good search rankings and/or
native social media traffic, and so many SEO consultants transitioned from
adding paragraphs of invisible text in the footer to adding weird meta tags to
the header in order to look better in the Facebook feed. Now that is
Most people who learned HTML in the 2005-2015 time period actually learned
XHTML, and may not realize it. That's why, today, they strictly close all
of their elements, including the empty ones.
This whole thing is made sort of funny by the fact that XHTML was rather
short-lived. The release of the HTML5 specification in 2014 largely
addressed all of the shortcomings of the HTML4.1 specification, and
obsoleted XHTML. Part of this is because HTML5 was the shiny new thing,
part of it is because HTML5 largely integrated the features of XHTML in
a more convenient fashion than XHTML, and part of it is because XML was
very popular with Microsoft who is extremely hateable.
In the end, XHTML is essentially forgotten today, very quickly in internet
terms although surely there are still plenty of websites out there written in
it and not yet updated. Perhaps the bigger influence of XHTML is that all we
Millenials are running around closing all of our elements explicitly, which is
considerably ironic in a world where we are omitting whitespace from our
don't seem to remove unnecessary closing tags.
Of course, HTML parsers being what they are, it's guaranteed that there are
parsers in use which will malfunction when presented with these completely
standards-compliant documents! I love parsing.
One of those things that nearly everyone knows about computers is that for some
reason "404" means "file not found." Most people that work with computers
seriously are aware that HTTP uses a set of three-digit numbers to report
status back to the client, and that these codes are categorized by first digit.
For example, the '2xx' codes generally mean 'success' and '200' means 'OK." The
'4xx' codes mean that there is something wrong with the request, and '404'
means that the requested file could not be found by the server.
Perhaps less widely known is where this whole idea of status codes comes from.
It's not unique to HTTP at all. Another widely used internet protocol, SMTP,
uses a very similar scheme of three-digit codes in which, for example, '200'
means something similar to 'OK' (really just that the server is sending back
a 'normal' reply) and '4xx' codes indicate a transient failure, for example
'422' means that the recipient's mail box is full (exceeding storage quota).
This is obviously very similar to HTTP, down to the rough meaning of the
SMTP was first formally described (by Jon Postel!) in RFC 821, dated 1982.
HTTP was first formally described (by Tim Berners-Lee!) in RFC 1945, dated
1996. Both protocols saw limited internal use prior to being published in RFC
format, but it's clear from the gap in years that SMTP is the older protocol.
In fact, it's kind of fascinating to me to consider that HTTP was published
when I was alive, as it seems so ubiquitous that it must be older than me.
Anyway, FTP was formally described (also by Jon Postel!) in RFC 765 dated 1980,
and in fact FTP uses a set of three-digit numeric status codes that also match
the categories used by HTTP. RFC 765 elaborates somewhat on the concept of the
The number is intended for use by automata to determine what state to
enter next; the text is intended for the human user.
We must remember that it was 1980, a rather different day in computing, when we
read that a separate numeric representation must be provided "for use by
automata." Indeed, a set of state diagrams is provided in the RFC based on
those codes. It's an extremely "early computer science" way to approach the
problem of designing a protocol. That is to say, it makes perfect logical
sense and is perhaps the best approach, but has been largely abandoned today
because such a state diagram for a "modern" protocol would span kilometers.
The question that interests me is whether or not FTP is the origin of the
concept of three-digit status codes or reply codes, and the rough
categorization of 100 for continuation, 200 for OK, 300 for redirect, 400 for
temporary error, and 500 for permanent error (HTTP uses those last two a little
bit differently, for client-side and server-side error).
RFC 765 was not the first discussion of FTP, which, being a very obvious idea
(what if we could use this newfangled network to move files around!), has a
long history. Numerous earlier RFCs represent different stages in the
development of the RFC protocol. The three-digit error codes seem to first
appear in RFC 354, a revision of the draft standard. Previous revisions of the
draft (and protocol, prior to being TCP-based) use one-byte binary error codes
or do not specify brief numeric error codes.
RFC 354 conveniently states that the FTP error codes are similar to the RJE
protocol. RJE, or Remote Job Entry, is a now forgotten protocol which was
essentially a very early form of RPC (as now done with protocols like XML-RPC
and arguably basically all network APIs). Indeed, RJE, as described in draft
form in RFC 360, includes a very similar set of status codes (including 200
OK), except that it also uses the 0xx series of codes.
Confusingly, RJE incorporates FTP as a component of the protocol, but an
earlier form of FTP based on NCP (not TCP) that uses one-byte status codes.
As suggested by the sequence numbers, RFC 360 is very close in date to the
previously mentioned RFC 354, and explicitly mentions that the same set of
status codes are intended to be applicable to "other protocols besides RJE
(like FTP.)" The wording in these two RFCs would seem to imply that the idea
originated with RJE and was then also applied to FTP; the two both had authors
at MIT who were presumably sharing notes, and there is logical overlap
between the two protocols including RJE essentially having an FTP "mode,"
which makes them difficult to completely separate.
This RJE protocol, as ultimately formally described in RFC 407 after revisions,
was actually somewhat sparsely used. RJE protocols in general were mostly used
with mainframe and time-sharing systems, which mostly predated ARPANET, and so
already had their own various RJE protocols implemented by the vendor or the
user (these were back in the days when owners of time sharing systems sometimes
wrote their own operating systems to get a few features they wanted). This
makes it pretty difficult to trace the history of RFC 407 in much detail, not
least because the term "RJE" refers collectively to at least a dozen different
such published protocols.
I was able to track down contact information for one of the authors of RFC 407,
Richard Guida. Unfortunately he didn't recall how the reply code numbers came
about, but I'm not especially surprised. Of course this was quite a long time
ago, but the reply codes also seem like a relatively obvious idea that
probably didn't strike anyone as particularly noteworthy at the time.
Notably, there is some precedent. The pre-TCP (NCP) version of FTP, which
predates RFC 407 RJE, uses a one-byte reply code in a fairly similar way to RJE
and TCP FTP. Speculatively, it seems likely that one of the authors of RJE (or
possibly TCP FTP which seems to have been written out more or less in parallel)
was familiar with the previous NCP FTP protocol and decided that replacing the
one-byte reply code with a three-digit ASCII reply code would both be more
human-readable (useful in a time when debugging protocol implementations by
interacting with them "manually" was probably more common) and would allow for
hierarchical organization by digit.
In fact, the hierarchy was somewhat more specific then. Both the RJE and TCP
FTP specifications refer to the reply codes as being organized into three
levels by hundreds, tens, and ones. HTTP makes no mention of such a three-
level hierarchy, only the two levels of hundreds and ones. While Tim
Berners-Lee was clearly inspired by the RJE/FTP reply codes, he did not
duplicate their structure as faithfully as SMTP.
In summary, the three-digit HTTP status codes date back to at least 1972, and
were already about a quarter decade old when they (or at least a similar set)
were used for HTTP. We are now coming up on 50 years since 200 "OK" was first
defined, and it does not seem likely that it will go away any time soon.
One might question the utility of having these numeric reply codes when there
are also text explanations sent along with them. The original intent seems to
have primarily been that the numeric codes were easier to parse and use in
software. That said, all the way back, protocols which use these codes have
stated that the text representation is not bound to a specific string. This
means that a 404 error is a 404 error regardless of whether or not the
accompanying text error is 'File Not Found,' which could allow for
internationalization or just unusual server configurations.
Of course, in the world of HTTP, these errors are almost always represented to
the end user in the form of a dedicated page designed to express the error. As
a result, the actual HTTP status code and conventional error string "File Not
Found" are basically irrelevant. That said, both browsers and servers have long
had default representations of these errors which included the literal phrase
"404 File Not Found," and this has pushed the status code and error string into
the cultural lexicon firmly enough that they remain in common use on custom-
designed error pages that could say whatever they want.
In the end, a fairly minor detail of a network protocol could end up
influencing the popular culture fifty years later. Kind of makes you nervous
about your API designs today, doesn't it?
There is an interesting little chapter of computer history involving ASCII and
ASCII is, of course, the American Standard Code for Information Interchange. I
often say that computer science is an academic discipline principally concerned
with assigning numbers to things. Of the many things which need numbers
assigned to them, the letters of the alphabet are perhaps one of the most
common. ASCII is a formal standard, derived from several informal ones, for
allocating numbers to all of the characters which were deemed by the computer
industry to be important in 1963. It likely requires no explanation that ASCII
accounts only for the English language and American currency.
ASCII itself is not especially interesting, besides to note that it is in fact
a seven bit code, which leads to the important "computers are bad" theme of
what it means for a system to be "eight bit clean" and why some systems are
not. That is a topic for a later day, though. Today I will constrain myself to
ASCII and Japan.
Japan, of course, principally uses a language which cannot be represented by
the 127 code points of ASCII, most of which are English characters and
punctuation and the rest of which are control characters no one can be bothered
to remember. At the same time, Japan was the first adopter of computer
technology in East Asia and, by many metrics, one of the first adopters of
computer technology outside of the United States. Considering that nearly all
early computers either used ASCII or an even smaller character set, this raises
an inherent problem, which was largely resolved by the introduction of various
Japan-specific character sets (often called "code pages" by earlier computer
systems), which eventually mostly consolidated into SHIFT-JIS.
And yet, in Japan, ASCII was for a time a very big deal. I am talking, of
course, not about the US cultural dominance of Japanese industry being forced
to at least partially use Roman characters due to the limitations of technology
designed in America, but rather to the ASCII Corporation.
The ASCII corporation published ASCII Magazine, which was the preeminent
computer technology magazine of Japan. Being published in Japanese, ASCII
magazine was, of course, not representable in ASCII. Most interestingly, ASCII
Corporation was, for over a decade, the Asian sales division for Microsoft.
Microsoft and ASCII collaborated to design an open standard for personal
computers called MSX, which was on the market at the same time as the IBM PC
and ultimately failed to gain more traction than PC clones. That said,
Microsoft's experience with MSX, along with the PC, was no doubt one of the
motivators in Microsoft's broader philosophy of decoupling the hardware vendor
and software vendor.
This is all somewhat aside the curiosity of the name ASCII. I have found
limited historical information on ASCII Magazine. In part this is because the
original material is in Japanese, but I have noticed a more general trend of
historians of computer history being oddly uninterested in the popular
publications. The kind of excessively concise summary usually given of ASCII
Magazine's history is typical of the US computer hobby magazines as well.
What is fairly well documented is that the key founder of ASCII magazine and
the ASCII corporation had recently visited industry events in the US, and of
course Japanese computer hobbyists would have been well exposed to ASCII due to
the common use of imported American and British computers. It seems likely that
the founders simply chose a "computer-ey" term that sounded cool, nearly all
such terms being of course divorced from their original meanings when borrowed
The introduction of computer technology into foreign markets is the kind of
topic that you could write many books about. The case of Japan is interesting
for being perhaps the first major market for American and British computer
companies which used characters other than the Roman alphabet, essentially
introducing the problem of internationalization which we know and love today.
Some time later Arabic lead to a second round of the effect as software had to
be made to account for right-to-left layout. Both of these are still very much
real problems today, with character encoding confusion and RTL layout failures
a common experience for users in these regions.
Character encoding failures are relatively unusual for English speakers. This
is mostly because a large portion of character encodings (including, most
importantly, Unicode) are derived from ASCII and share the ASCII code points in
common---the ASCII code points being pretty much all that's used in American
English, and nearly all that's used in British English except for that
problematic £. Of course ASCII does not account for certain aspects of English
typography such as ligatures and various lengths of dashes, and these are now
often viewed as unnecessary flourishes as a result. It's hard to blame any of
these problems entirely on computers, though, as the same issues were present
(and sometimes more severe) in typewriters.
There is, in general, a large factor of "first-mover advantage" here. Computer
technology was largely developed in the US and UK and so it was largely
designed around the needs and sensibilities of English-speaking users. On the
other hand, there is also a phenomena of "first-mover disadvantage," which is
exemplified by the European cable television standard (PAL) having been
generally superior to the US standard (NTSC) due to being developed several
years later when better electronics were available. But, then, PAL networks
ended up delivering a lot of content that had been (crudely) scaled from NTSC,
because of the cultural dominance part.
The other non-English-speaking country with significant early computer
development was Russia. Because most of this development happened behind the
iron curtain and under state (and specifically military) purview it is not
always as well documented and studied, especially from the US perspective.
By the same token, internationalization of English technology to Russian (and
vice versa) was relatively uncommon, and Soviet computer history is essentially
its own separate but parallel process.
One of the thorniest areas for internationalization is in the tools themselves.
Out of the wide world of programming languages, ALGOL is almost unique in
having been intended for internationalization. ALGOL was "released" in multiple
languages, with not only the documentation but also the keywords translated.
There have been occasional "translations" of programming languages out of
English but none have ever been successful on any significant scale. If you are
truly interested you can, for example, obtain a compiler for C++ but in
Spanish. No one who speaks Spanish actually uses such a thing.
The dominance of English in computer tooling is exemplified by Yukihiro
Matsumoto's Ruby programming language, which uses keywords in English rather
than Matsumoto's native Japanese, even though it was initially little known
outside of Japan. English is thought to be the "lingua franca" of programming,
a term which is a bit ironic in that one of my most frustrating personal
stories of software was my going in to solve a simple problem in some open
source software, only to find that the comments and symbols were entirely in
 There's actually kind of a neat trick where if you lay out the ASCII table
in four columns it makes a lot of intuitive sense. This is a lot like saying
that if you count the letters in every word of the Bible you will hear the true
word of God.
 At the time this was referred to as an Independent Software Vendor or ISV.
Today, the concept of the software being developed by a different firm than the
hardware is so normalized that the term ISV is rarely used and comes off as
slightly confusing. Where once Microsoft had stood out for being (mostly) an
ISV, now Apple stands out for being (mostly) not an ISV.
 The poor quality of early NTSC-to-PAL conversions was one of many things
lampooned by British satire series "The Day Today," where the segments from
their supposed American partner network featured washed-out colors, a
headache-inducing yellow tint, and intermittent distortion. This was indeed a
common problem with American content broadcast in Britain, prior to the use of
digital video. British content broadcast in America seems to not have suffered
as much, probably because the BBC made more common use of the "kinescope"
technique in which the television recording was exposed onto film, which was
then recorded back into television in the US using NTSC equipment.
 This is quite unfortunate because a combination of pursuing alternate paths
and wartime/economic challenges lead Soviet computer development into some very
interesting places. Vacuum tubes were used in the USSR well after their falling
out of favor in the USA, which lead to both some amazing late-stage vacuum tube
designs as well as Russia being the world's leader in vacuum tube technology
Some time ago I got into a discussion online which led me, once again, to
articulate my belief in the spiritual significance of the telephone. I will try
to articulate the point, somewhat more clearly, here.
Lately I have been reading Marc Reisner's "Cadillac Desert," an excellent and
important book about the large-scale waste and destruction of the West's water
resources. The book has been compared by some to "Silent Spring," which I think
simultaneously illustrates that it is a good book on an issue of critical
importance, but also shows the sad state that "Silent Spring" more or less
triggered an environmental movement while the issues "Cadillac Desert"
discusses have seen virtually no progress today.
Well, that's a bit besides the point, but there is something that Reisner talks
about in the book that I think is important. From the beginning he explains
that the projects to irrigate large areas for farming in the West were always
economically undesirable. That is, consistently, the cost of building the
irrigation project was much larger than the value of the farming it enabled.
Yet, these projects were very politically popular, at most times across both
parties---including the fiscal conservatives. So, one wonders, if not money,
and if not agricultural production itself (as these projects frequently only
enabled production of crops already available in excess), what led to all of
these dams and waterworks?
Reisner argues that, in the American West, irrigation is a religious issue
rather than a practical one. There is some justification for this right off
the bat by observing that the Bureau of Reclamation was established principally
by Mormons for whom it was quite literally a religious issue, but that almost
misses the point. The important thing is that irrigation projects were pursued
because they were righteous, because they were an important component of
American ideals, the American ideal being, of course, fertile land, not open
desert. The appeal of irrigation as a religious project to civilize the West
drove politicians and engineers to pursue these works beyond all reason.
Of course, this sounds rather familiar, doesn't it. Most of us in school learn
about a prominent spiritual movement with an impact on the West, and that is
Manifest Destiny. In fact, the development of enormous irrigation works in the
West like the Hoover Dam is, essentially, an extension of Manifest Destiny, but
in the ever more ambitious sense that we ought not just settle the West but
change it to fit our Eastern sensibilities.
The effect, I think, is not restricted to irrigation.
By near universal agreement, the concept of Manifest Destiny in textbooks,
school lessons, and Wikipedia is illustrated by the painting "American
Progress" by John Gast, which depicts droves of settlers headed west by horse,
wagon, and train. Prominently, though, in the foreground, the painting features
the lady Columbia headed west as well and stringing, behind her, a telegraph
When Gast painted "American Progress" in 1872, AT&T (or then, the Bell
Telephone Company) had not yet quite been founded. The Long Lines division,
with its explicit goal of connecting the nation, would not be established until
six years later. Gast was most likely thinking at the time of the railroad
telegraphy system and the early telegraph giants like Western Union.
One of the many lessons of the early 20th century is that it is difficult to
operate any national enterprise when it takes weeks to convey messages between
offices. The railroads and the financial industry were some of the largest
organizations to run into these problems, which is of course why they were
early adopters of telegraphy.
It was in this context that AT&T got off the ground. While the divide between
telephone and telegraph back then was somewhat larger than it is today
(telephones having been enormously expensive early on), there was still a sense
that the telephone was solving the same problem as the telegraph, and perhaps
better. AT&T was at least a spiritual successor to Western Union, as well as
claiming away most of their business.
AT&T really gained steam in the early 20th century. It was 1907 when AT&T
essentially announced its intent to become a monopoly---"One System." Manifest
Destiny had largely petered out by then, but I would argue that within AT&T,
the spirit of "Telephone as Civilization," "Telephone as Progress of Man," and
"Telephone as American Ideal" was stronger than ever. In fact, it was AT&T's
rapidly acquired monopoly status that facilitated this fervor. Religious values
do not especially thrive under capitalism, but AT&T was not subject to
capitalism: they weren't just a phone company, they were the phone company,
and the regulation that oversaw their monopolized service was just as devout in
the religion of the telephone as they were.
This view of the telephone as religion can shed useful insight into the
behavior of AT&T up to (and surely to some extent after) the breakup, but
perhaps most significantly is a way of analyzing how AT&T changed after the
Prior to the breakup, AT&T expanded and improved their network with religious
zeal. This dedication to their cause lead to the establishment of Bell
Laboratories and, ultimately, to the transistor and in many ways to the
computer. At the same time, it led to high consumer rates, because rates were
determined not competitively but by AT&T's insatiable desire to invest.
Telephone was a religion less in the sense of Jesus Christ and more in the
sense of George Washington. In the early 20th century these two were hard to
separate from each other, Washington's apotheosis having been illustrated
relatively recently. The First World War, and much more so the Second World
War, challenged deities in more ways than one, and by 1950 Nietzsche would
presumably have declared Gen. Washington to be dead.
AT&T, though, by merit of its unusual position as a protected monopoly,
continued through the mid-century with a strong belief in its own god and
continued to adulate it with the dial tone. Saul Bass's 1969 pitch reel
introducing AT&T's new corporate branding system depicts some of this spirit,
along with an example to remind us that the apparent insanity of the
"Gravitational Pull of Pepsi" is a not a new phenomenon. This video is
available on YouTube courtesy of the AT&T Archives and you absolutely must
watch it, several times, if nothing else to appreciate the truly period
fashion sense espoused in the new uniform designs.
More to this point, though, the video is a brilliant artifact of the trailing
end of the period in which Telephone Men were an institution as strong as
letter carriers once were, Telephone Women wore uniforms behind the switchboard
to be seen by no one but themselves, and Telephone Executives had little Bell
logos embroidered on their french cuffs.
Yes, it's a work of corporate branding and so essentially a work of corporate
advertising and everything is shown in its best possible manifestation. But
there are hints of the kind of care that we don't often see today. Outside
plant crews wore a uniform under their coveralls, the coveralls to be removed
whenever they entered a customer's premises to avoid bringing in dirt and
grease. At least, this was the goal. Today the usually subcontracted telephone
technicians set the lofty goal of arriving within a four-hour window and mostly
miss it. I once had a long conversation with a technician subcontracted by
CenturyLink about how he hoped to buy a self-service car wash and get out of
the whole telephone mess. This conversation occurred as he frowned at his
instruments and worried to me that the infrastructure was simply in too poor of
repair to get VDSL to work on more than one pair. After over an hour of walking
back and forth between house and pedestal, interspersed with phone calls
related to said car wash acquisition, he declared it impossible to provide me
the service I had tried to subscribe to. I rate this as a very positive
interaction with CenturyLink's consumer division because he arrived, admittedly
at the wrong time, but on the correct date, and at least put on the appearance
of exerting real effort before declaring the telephone system hopeless.
This is all very anecdotal, of course, but the real point to examine is that of
reliability. The first commandment of the religion of the telephone is "Thou
shalt deliver a dial tone always." Much like Reisner's irrigation engineers
fervently executed projects which would return cents on the dollar (in the best
case), the Bell Systems' engineers invested their effort in chasing out yet
another "nine" in reliability which would be hardly noticed by customers.
Electronic telephone switches were built for enormous redundancy. WECo's
installation service coordinated armadas of technicians like choreographing
dancers to transfer customer lines from an old switch to a new one in a matter
of minutes and with only seconds of interruption per customer. In perhaps the
crown Jewel of the Bell Systems' dedication to reliability, in 1930 the
Indiana Bell building was moved in its entirety to make room for a new larger
one---all while in active use, utility cables dragged slowly behind and a
wooden walkway, practically airstairs, wheeled along with the building's
entrance so that the staff could come in and out of their offices as usual.
In most industries, a service interruption might be scheduled to facilitate
cutover to a temporary switch, then again to cut over to a new one. The Bell
system routinely managed replacement of switches with zero downtime using
strategies that varied from complex (splicing switching devices into in-use
telephone wiring to prepare for "all at once" cutover) to whimsical (lining
up in rows along the distribution frame, cable loppers in hand, to cut out
the old switch in time to a supervisor's whistle).
There was, of course, a fall from grace.
I said that religion does not thrive under capitalism, and of course this was
the fate met by the Bell system. The breakup of the Bell system in 1982
occurred primarily in response to their very high rates, which were
(accurately) seen as symptomatic of the monopoly they enjoyed. The breakup
was successful in reducing rates and was a key step towards the situation
we have today in which multiple competitive cellular carriers are (mostly)
driving their rates downwards over time.
But, of course, it is clear that AT&T's high rates were not exclusively a
result of privileged profiteering. They were also a result of AT&T's enormous
R&D budget, their dedication to reliability, and their generous staffing from
customer service to engineering. The telephone system was never quite the
Garden of Eden but competitive phone service certainly was forbidden fruit.
While costs have decreased tremendously, so have reliability and quality of
service. The surviving fragments of the Bell System are now some of the most
hated companies by consumers. They're often second only to their later upstart
competition, the cable television carriers, which exist in a similar state of
sin but, having grown up entirely in such a fallen state, lack even the memory
of their former grace to moderate their avarice. I'm not sure what the seven
deadly sins of the religion of the telephone are, but I can promise that
Comcast is guilty of every one.
There is a great deal of economic analysis which can be done to explain the
changes that the telephone system underwent after the breakup of the Bell
system. The truth is sufficiently complex that it's hard to say whether or
not the whole thing was a good idea. What does seem certain is that it was
inevitable; if competition was the forbidden fruit, MCI was the serpent. Or
perhaps Carterfone? Maybe Carterfone is the serpent and Sprint and MCI are
Cain and Abel. I don't know, the metaphor could use more work.
All of this depicts a rather rosy and simplified view of the whole situation.
Of course pre-breakup AT&T was far from pure virtue, and post-breakup there
have been meaningful improvements in consumer service. The poor reputation of
the telecom industry today has in part to do with market and social forces that
probably would have existed regardless, late-stage capitalism and all, and a
radically different world in which AT&T had, say, been nationalized and MCI,
Sprint, etc. bought out by the new American Telephone and Telegraph
Administration, taxpayer dollars at work, would presumably have all of its own
downsides. Nothing is so simple. I'm just here to tell a nice story, though,
and maybe there's some insight in it.
While the economic and regulatory analysis is important, I think it misses some
of what happened: beyond a financial aspect, there is a social aspect to the
history of the telephone system, and a good part of that social aspect is the
rise and fall of a religion: not God's chosen people, but the Telephone Men.
The telecom industry was already giving in to vice by the time of the breakup,
but the breakup was the crisis of belief that led to complete atheism, and
then, moral relativism. Or at least tariff relativism.
Whatever happened to traditional telephone values? The market is what happened.
Well, the market and everything else.
 For the late 19th and early 20th century, the telegraph system was
conjoined at the hip to the railroads, both being fundamentally involved in
finding long-distance rights-of-way and the railroads relying in part on
telegraphy for their own business. While railroads sometimes constructed their
own telegraph lines they also often contracted this to Western Union. Until
1960, WU had work crews which lived in modified passenger trains to maintain WU
equipment on railroad RoW.
 A rather vivid demonstration of the slow travel of news prior to the dual
revolutions of the railroad and telegraph is California's admission as a state.
On Sept. 9 1850, California was admitted to the United States. No one in
California knew this fact until Oct. 18, over a month later, when the ship
Oregon arrived having carried goods---and incidentally news---all the way
around South America. This was a long journey, but the overland trip from East
to West was even longer. Incidentally, of personal interest, New Mexico was
established as a territory at the same time.
 Like Bell Labs was the research and development arm of the Bell System,
Western Electric Company or WECo was the manufacturing arm, which built and
serviced designs out of Bell Labs, and did no small amount of R&D on its own.
Like Bell Labs, WECo was lain fallow after the Bell breakup. What remains is
scattered across the telephone industry, especially Avaya, but the core of
WECo, along with Bell Labs, is now part of Nokia.