happy new year

2021-01-04

Once upon a time, an impending new year spurred quite the crisis.

When I was a high school student, I volunteered (actually "interned", but the unpaid kind) at the thrift shop of a nonprofit electronics recycling organization. This is the kind of experience that leaves you with quite a few stories, like how before using the public WiFi you had to agree to a terms of service that was actually just a weird ramble about how WiFi might give you cancer or something. But, there is one story that is apropos to the new year: I maintained a small collection of power strips labeled "Y2K Compliant" that I had priced $5 higher than the other, presumably non-Y2K compliant ones. I like to think that the "Y2K Compliant! +$5" sign is still up at the thrift store but like many of my efforts I assume it is long gone. Last time I was there they had finally made the transition to thermal receipt printers I had labored to start, eliminating the 30 second time to first page/awkward pause on the laser printers that had been used for receipts.

I'm not sure who it was now, but judging by the number of them we had on hand it was a fairly large manufacturer of power strips that started putting "Y2K Compliant" on their labels. This is, of course, complete nonsense, and ultimately Y2K ended up having very little impact compared to the anticipation. Nonetheless, people who dismiss Y2K as having been a hoax or somehow fake are quite incorrect. Successfully navigating the year 2000 transition required a concerted effort by software engineers and IT analysts throughout various industries.

The tedious work of validating software for Y2K compliance was one of several things lampooned by the film Office Space. I'm not sure that that many people in software today even understand how Y2K was a problem, though, so I will give a brief explanation.

The common explanation of the Y2K problem is that programmers decided to save space by only storing two digits of the year. This is basically true, but many people in software today might find that sentence a bit nonsensical. An int is an int, which is today 32 bits, right?

Well, sure, but not historically.

Many early business computers, most notably but not at all limited to various IBM architectures such as the influential System/360 family, made extensive use of either Binary Coded Decimal (BCD) or a slight optimization on the same format (Packed BCD). That is to say, these computers did not use the power-and-significand exponential representation that we think of today for floating point numbers (e.g. IEEE floating point). Instead, they stored numbers as a sequence of base 10 digits. That is, essentially, as a string.

It's sort of funny how, to a lot of CS students and programmers today, the idea of using BCD to represent numbers is absurd. Representing the quantity 10.35 as a sequence of bytes encoding 1, 0, 3, and 5, with a value that is either an offset to the decimal or an exponent (base 10) depending on how you look at it, feels similar to string typing, which is to say that it is practically blasphemous, even though today's most popular software stack uses something which is largely more confusing than string typing.

I would argue, though, that it is IEEE floating point notation which is the eccentric, unfortunate choice. Consider this: floating point operations often pose a subtle challenge in software engineering largely because the precision properties of modern floating point representations are decidedly unintuitive to humans. The resolution with which IEEE floats represent numbers differs with the magnitude of the number and is difficult for humans to determine.

This leads to concepts like "machine epsilon" which attempt to quantify floating point precision but are difficult to actually apply to real-world situations. Similarly, floating point numbers can be made more precise by allowing more bits for the representation, say, 64 instead of 32. This is still fairly confusing though, and very few people have any intuitive or even rote sense of how much "more precise" a 64-bit float is than a 32-but float.

The reality is that power-and-significand floating point representations are just plain confusing.

BCD, on the other hand, is not.

BCD represents floating point numbers the exact same way that humans do, in terms of a set of digits. This means that the precision properties of BCD are very easy to understand: adding additional words (bytes, etc) to the end of a BCD number increases the significant digits (in decimal) terms of the number. This is really very easy to follow, and often makes it very easy to make choices about how long the representation needs to be.

While the underlying reasons are somewhat complex, it is an accurate summary to say the reason we use power-and-significant floating point representations rather than BCD today are... "technical reasons." IEEE representation is amenable to highly optimized implementations of a variety of operations, has the property of a fixed size regardless of magnitude which is extremely convenient for implementation, and ultimately is very amenable to implementation on RISC systems[1]. This is all to say that IEEE representation is better for every purpose except interaction with humans.

Good thing the evolution of computing has rarely, if ever, actually had any regard for user experience.

So this was all basically a preface to explain that the Y2K bug, to a large extent, is a direct result of BCD representation.

In particular, the Y2K bug tends to emerge from the use of COBOL[2]. COBOL is a very interesting language that deserves a lengthy discussion, but one of the interesting properties of COBOL is that it has a data serialization format as a key feature. In a fashion somewhat similar to modern libraries like protobuf, COBOL programs include as part of their source a description of the data structures that will be stored. These data structures are described not so much in terms of types, but instead in terms of the actual byte-wise serialization of those types.

Although COBOL now supports IEEE floating point, BCD representations are much more typical of the COBOL ecosystem. So, COBOL programs typically start by describing the numbers they will store in terms of the number of digits.

So, to summarize, to a large extent the source of "the Y2K bug" is that a large number of computer systems were implemented in COBOL and specified a serialization format in which the year was stored as a two-digit BCD value. This made sense because storage and memory were both very expensive, and in the '80s there hadn't been software for long enough for there to be legacy software, so few engineers probably realized that the heap they had written would still be in use in the next century.

"Fixing" the Y2K issue, as parodied in Office Space, basically entailed modifying all of this COBOL software to specify four digits instead. Of course, not only was this a considerable amount of effort for large codebases, you also either needed to convert all stored files to the new format or modify the software to detect and handle both the old and new serializations. What a headache.

I'm not sure if there's some moral to draw from this story, it just came to mind since we hit the new year. The good news is that no one makes mistakes like this today. Instead, Sony GPS receivers stop working because there were too many weeks in the year and block storage drivers stop working because there was a leap day in the year and ultimately a ton of software uses a 32-bit counter of seconds since 1980 that's going to overflow in 2038, so clearly we've all learned our lesson about not accommodating numbers having slightly higher values than we had once expected.

Most of why I write about this is because I, personally, miss BCD. Poul-Henning Kamp once wrote for ACM that he believed the choice of null-terminated strings (over length-prefixed strings) for C to be the most expensive single-byte mistake ever made in computer science. Along this same vein of thinking, one wonders if the success of IEEE floating point representation over BCD has been a mistake which has lead to huge cost due to the numerous and subtle errors caused by the sharp edges on the representation.

At the cost of less performant and more complex implementation, BCD would nearly eliminate a huge class of errant software behavior. Never again would we get an unexpected .00000000000000001 or have algebraically equal numbers compare as non-equal. On the other hand, we would gain a new class of errors related to more frequent overflow of numbers since an additional digit is required each power of ten.

Would we all be better off now if BCD had won? Perhaps. I mean, the next world-ending crisis it would cause wouldn't be until the year 10000.

[1] While there are plenty of exceptions, it's a good generalization to note that BCD number representation tends to be associated with systems that make extensive use of abstraction between instructions and the actual machinery. In the modern era we would tend to call this "microcoding" but I am actually referring to things like the SLIC in IBM architectures, which is somewhat analogous to the microcode in the x86 architecture but tends to be significantly "thicker." Consider that the SLIC in modern IBM systems is essentially a virtual machine implemented in C++, not so dissimilar from say the JVM. Since arithmetic operations on BCD are fairly complex and have very variable runtime, it is much easier to implement them as machine instructions in highly abstracted systems (where the "instruction set" is really more of a high-level interface implemented by underlying software) than in RISC systems like x86 (where the "instruction set" is really more of a high-level interface implemented by underlying software but we all feel bad about this and try not to discuss it too much).

[2] The use of BCD is of course not at all limited to COBOL and plenty of Y2K non-compliant software was written in assembly or other languages. I use COBOL as the nearly exclusive example though, because it is fairly easy to find examples of COBOL software today demonstrating the allocation of two BCD digits to the year field, while it's fairly difficult to find such examples today in other languages. I also like to bring up COBOL because the idea of a serialization format as the core data structure of a language is something which fell out of fashion, and so is rarely seen today outside of legacy COBOL. Compare MUMPS. Sorry, err, the "M Programming Language."