some more formats

2020-07-15

Let's talk about some more formats. Last time I basically left myself an agenda for the next message, so I'll do my best to adhere to it for once.

Fixed Width Fields

Fixed width fields are a common feature of older data interchange formats. For the unfamiliar, the idea of a fixed-width field is simple: if you have, say, three fields for each record, just say that the first one is 10 characters long, the second 10 characters, and the third 10 characters. Now you just pad or truncate each field to fit. The main advantage of fixed-width fields is that they make parsing very simple because the parser just grabs the next so many characters to get each field. The downside is that it is inefficient when values are shorter (due to padding characters) and loses data when values are longer (due to truncation). As a result, fixed-width fields are generally only suitable when you have minimal variable-length data. For example, fixed-width formats can be a good fit for accounting applications when you have a strong sense of how many digits will be in the numbers you deal with and can accept needing to provide some kind of special-case handling when a number somehow turns out to be longer.

As you can imagine, in most cases fixed-width fields turn out to be too much of a hazard (in terms of technical debt) for practical use. If you tilt your head just right, the whole Y2K fiasco was basically a result of the choice of fixed-width fields that were too short to meet future use-cases. Sure, the year field maybe always should have been four characters, but in 1975 two characters seemed like plenty to meet the need. Just like how eight characters ought to be enough for Unix usernames, and for filenames plus a three character extension. All of these arbitrary limits were great and fine.

And yet, fixed-width formats were quite common in earlier computer systems and still pop up today, mostly in relation to legacy systems and formats. Let's think about why.

The first reason is the punched card. Punched cards have varied in length historically, but when you say "punched card" what most people think of is 80 columns wide[1]. The 80-column card dates back to 1928(!), but is widely known today as the "FORTRAN Statement Card" because it was the format adopted by FORTRAN, which became their most popular application, and so most of these cards seen today literally say "FORTRAN Statement" on them regardless of what they were actually used for. Because FORTRAN was designed for these cards, earlier versions of FORTRAN (such as F77) imposed the restrictions of punched cards even when reading source from text files. This includes a limit of 80 characters for each line and special meanings for the first several columns---such that a FORTRAN statement always began in the 7th column. FORTRAN 99 relaxed these restrictions and allowed for modern use of indentation.

Because punched cards were a fixed width, there were already specific limits imposed on the length of fields, and so it made sense to divide them in a fixed-width fashion. In fact, the reason for doing so is less logical and more physical, because punched cards (including the 80-column variant) were first introduced for use with purely electromechanical machines which had to be designed or configured (by jumper wiring) to understand that certain columns belonged to certain positions. These mappings could not easily be changed, ruling out variable-length fields.

Fixed-width fields were widely used throughout computing of the era but were particularly important in COBOL. One of the features of COBOL was its built-in data model. COBOL essentially had a concept of data structures (somewhat like c's enums but more sophisticated) which were natively serializable to cards, tape, or files. They were natively serialized because they were already stored in memory in a simple linear format using... fixed width fields. When describing a record format a COBOL user had to provide the length of each field in characters, including numeric fields---which made plenty of sense because numbers were almost always represented in BCD at the time, so number of characters and numeric precision were the same thing.

So, in essence, a FORTRAN record was a string of characters, and the record format indicated which character offsets corresponded to which fields. Records were both manipulated in memory and written to cards, tape, and disks this way. Fixed-width fields remain especially prominent in fields with significant historic use of COBOL, such as the finance industry, where for example the automated clearing house (ACH) system is based on fixed-width-field text files moved around by SFTP. The use of fixed-width fields in banking computer systems is also the basic reason why the charge descriptions on your credit card statement are INSCRUTABLE ALLCAPS ABBVTD S. In addition to its use of fixed-width fields, COBOL was frequently used on systems which supported only uppercase characters (either as a limitation of the computer's code page or as a limitation of the terminals)[2], and all-caps has been remarkably long-lived as a Thing Computers Do In Bureaucracy.

Fixed-width fields are rarely used in "modern" text-based interchange formats because of their poor ergonomics and the obvious problem of determining the correct field length. That said, fixed-width fields are of course in widespread use in non-text formats including most types of binary serialization. Considering that, for many purposes, your computer never handles numbers of any length other than four bytes anyway, it makes sense to use a fixed four bytes for them.

Field Separators

More logical to us today than fixed-width formats are formats in which fields (on each line) are separated by some type of delimiter. The idea of reserving some character that is not likely to appear in the actual data to serve as a delimiter is one with a long history. As a notable example, in his paper on the Entscheidungsproblem, Alan Turing used the schwa to mark the end of data on the hypothetical machine's tape. Besides being a reasonably obvious idea Turing was likely aware of pre-computer precedents as well, such as telegraph operators using a distinctive symbol to mark the end of each message. Turing actually referred to these characters as "sentinels," but today "delimiter" is the norm.

It might seem that an obvious criteria for delimiters is this: the delimiter should not normally appear inside of the field. If a field contains the delimiter which will be used to mark the end of it, it will be necessary to somehow mark the delimiter character as "but not really." Today we refer to this as "escaping" the delimiter character, although the term is somewhat confusing in this case. "Escape codes" were originally sequences that literally began with the escape character, but the term was later expanded to describe any sequence of characters which start with a certain special character and encode a meaning as a single unit. So, an example to make this concrete, in many modern programming languages we may use single quotes (') to delimit the start and end of literal strings. A literal string may sometimes contain a single quote, so we have to "escape" that single quote, except that instead of the escape character we use the backslash. ' is a special sequence, identified by starting with a backslash, that means "this encodes a ' but is not a delimiter'. I like to avoid the term "escaping" in reference to a delimiter because this kind of use of escape sequences is actually a special case, escape sequences are a much more general concept, and so it's slightly confusing to learners (although very common) to use the terms "escape sequence," "escape character," "escape code," "escaping," etc. to refer to all of these things that are not obviously related.

You can see that this whole thing about escaping is kind of a hassle, so we want to ideally eliminate it but at least minimize it. That means selecting a delimiter that never or rarely occurs in the data. ASCII provided a convenient mechanism for this: the first 32 ASCII characters are "control codes," and as many as a dozen of these (depending on definitions) are dedicated to marking the start and end of things. So this appears to be an open and shut problem: we need special characters to delimit things and there they are. But, in practice, these control characters are very rarely used. There are a number of reasons for this, but the most obvious and realistically most significant are simply poor ergonomics. There is no button on the keyboard for ASCII 1f "Unit Separator," and no one [who speaks English] likes to use characters that aren't on their keyboards. Further, there is no well-accepted convention for displaying these characters. They basically rule out any sane hand-editing of data.

So, instead, "printable" characters (meaning ones on your keyboard) are generally used as delimiters. This presents a problem since the characters on the keyboard are all reasonably likely to appear inside of data. Early on, it was very common to select characters like |, , `, and ~ as delimiters because they are rarely used in text, and really only exist in ASCII and on standard keyboards by happenstance. The | was particularly popular because it resembles a dividing line and was already used on typewriters to make vertical rules in tables. In general, it was an obvious and relatively ideal choice for a field delimiter. Today pipes are still often used as field separators in certain log formats, especially in the POSIX world.

But, of course, far less common than the pipe in modern usage is the comma. Comma as delimiter is so common that the conventional term for it, comma separated values or CSV, has become basically synonymous with tabular data in plain-text form. Comma has its upsides in that it's a familiar character and already has a related semantic meaning in natural language where it's used to punctuate lists, but has the serious disadvantage that it commonly occurs in text. Meaning that your comma-separated fields may have commas in them. These commas then need to be escaped.

Wikipedia, which as previously mentioned is never wrong, tells us that the used of a comma as a field delimiter was present in an IBM FORTRAN compiler in 1972. Further, some additional research suggests that FORTRAN (including FORTRAN 77 in which this feature was standardized) is also the source of the maddening "quoting" semantics that exist with CSV. That is, when I talked about escaping delimiters using escape sequences, I was describing a "modern" approach to the problem. CSV typically takes a different approach called "quoting," in which fields that contain the delimiter must be surrounded by quotes. The quotes do not demarcate the fields, though, only allow sections to contain the delimiter. This leads to some truly insane situations, where for example the string ,"", in a CSV field must be "quoted" as ","""",". " isn't exactly ergonomic but """ manages to be worse.

Comma delimiters were a massive mistake, and attempts to formally standardize the format (e.g. various RFCs) generally only serve to illustrate how poorly defined "the CSV format" is, being basically a loose description of the nonstandard behavior of an early-'70s FORTRAN compiler. That said, the format became popular in business applications (because IBM used it) and was a natural "lowest common denominator" format for spreadsheet tools, so we are now pretty well stuck with it. Unfortunately, the term "CSV" is used so carelessly to describe so many things that it often requires careful handling. For example, when you open a CSV file in Excel it prompts the user to choose all kinds of parameters for how the file will be parsed. This is of course extremely user friendly.

Another once-common choice of "user-friendly" delimiter that has fallen out of popularity is tab. Files which use the tab as a delimiter are sometimes called tab-separated values or TSV. TSV has the advantages that the tab character is unlikely to appear in a field, but they loop directly back to a disadvantage of the dedicated ASCII field separator character but somehow make it worse. Tabs are not just non-printable characters, they are characters that induce context-specific behavior in the printer (to snap to the next 8char interval typically). This means that TSV files as printed or viewed in editors (unless the printer/editor handling of tab is modified) look extremely wacky and are usually even harder to look at than CSV files.

The point here is that encoding structured data in text brings up a very fundamental problem: structured data tends to include text, so that there is no clear delineation between symbols that encode the structure of data and the symbols that are the actual data. This inner conflict means that virtually all text-based encoding standards require some kind of escaping or quoting convention. This sometimes gets complex. For example, in HTML, there is both a symbol other than > that encodes > and a way to demarcate a section of text where the actual > is to be interpreted as not being part of the markup. Naturally this way of marking a section to not be interpreted as structure must itself not be interpreted as structure but also must have an end delimiter which will not occur in the non-structure data (so that it can represent structure in the data that cannot represent structure), and so has a syntax that is completely mind-numbing: <![CDATA[ ... ]]>.

A bit about control characters

Having said all that about delimiters, let's talk a little bit about those first 32 ASCII characters. They are not all completely unused. For example, null or the 00 byte is in the ASCII character as, well, the NUL character. In null-terminated strings we tend not to think of the null as part of the string (and thus we don't think of it as "text"), but the ASCII coding allows us to view it as a part of the string if we want to.

Carriage return and line feed are also widely used to represent a new line (of course LF on Linux and CRLF on Windows, for historic reasons and just to inspire hate in us all). Backspace and EOF (end of file) are also used for their intended purposes in certain cases, but not really all that often---backspace over certain types of terminal connections and the EOF character is mostly only important on Windows as Unix chose a different architectural approach to handling the end of files.

But, more interesting, let's talk about that escape character. It is a long-running convention in both printer and video terminals to recognize special sequences beginning with the Escape key as "control sequences" which modify the behavior of the terminal. I am not sure where this originates, but DEC's first video terminal, the VT05 in 1970, behaved this way. IBM terminals of the same time period did not include "escape sequences" of the same fashion only because IBM took a radically different approach to video terminal interfacing which was not hampered by being a re-purposed telegraph and so provided a much more flexible way for the computer to communicate with the terminal. In general, IBM never really bought in to the basic "text in/text out" approach to terminals which was adopted by the mid- and minicomputer vendors primarily as a cost-saving measure, which is one of the fundamental philosophical divides between "big iron" and mid/minicomputing (e.g. "modern computers").

When these escape sequences were standardized by ANSI, they avoided collision with existing proprietary escape sequences by having all of their escape sequences start with the sequence ESC[. Not unrelievedly, the ESC character is conventionally represented in "control character" format as ^[, leading to ANSI sequences sometimes being represented in printable characters as ^[ or even ^[[. You may have seen these representations when you use the arrow keys on a terminal connected to a computer or software which, for whatever reason, does not understand the escape sequences and so echos them back as entered text. There are no ASCII characters for the arrow keys, remember, and so your terminal has to encode them in terms of ASCII characters using escape sequences. This all goes to highlight how much of a problem non-text-in-text and text-in-non-text and non-text-in-text-in-non-text gets to be.

To loop back around to relevancy, this is exactly the problem that markup languages face: they are used to annotate text, using the exact same symbols that constitute the text. To paraphrase von Neumann, anyone doing so is, of course, in a state of sin.

[1] If 80 columns seems familiar, yes, through a few steps of indirection not really dependent on FORTRAN, these cards are the reason that 80 characters is considered a standard width for terminals. More specifically, "interactive terminals" such as TTYs and video terminals are more or less based on "keypunches" which punched the holes in these cards, some of the later of which had 80 character wide displays on which they showed the entered data (the displays were simply easier and more ergonomic to read than the typed track on the card) and lead more or less directly to the invention of the video terminal. As a fascinating bit of design history, one early IBM "video terminal" used an 80x4 character CRT display, on top of which sat two angled mirrors, so that each of two separate operators saw an 80x2 field which allowed them to see the "card" they were entering and a status line. CRTs were very expensive at the time, this shared-tube design simply saved money. The allowance of a second line per operator for "status" is perhaps the inspiration for most later video terminals having some provision for a "status line" at the bottom of the screen. Both 80x24 and 80x25 are considered "conventional" terminal sizes because several popular terminals were 80x25 and allowed the bottom status line to be toggled on or off.

[2] If this seems a little crazy, keep in mind that early terminals and printers were electromechanical. Supporting only uppercase characters reduced the size of the type mechanism and the number of characters (and thus number of bits required to code for the characters), which could be a significant reduction in both the price and size of these devices. Further, early computer terminals were often modified teletypewriters (TTYs) which had used the baudot encoding, which includes only uppercase characters for the same reason, as well as to increase baud rate since symbols only needed to be five bits.