_____                   _                  _____            _____       _ 
  |     |___ _____ ___ _ _| |_ ___ ___ ___   |  _  |___ ___   | __  |___ _| |
  |   --| . |     | . | | |  _| -_|  _|_ -|  |     |  _| -_|  | __ -| .'| . |
  |_____|___|_|_|_|  _|___|_| |___|_| |___|  |__|__|_| |___|  |_____|__,|___|
  a newsletter by |_| j. b. crawford                       home subscribe rss
>>> 2020-08-08 instant messaging

Computing and communications technology evolved very rapidly during the 20th century, particularly during and after WWII. Multiple universities, research groups, and corporations pursued development of computing technology simultaneously and in parallel, but often with nuanced variations in their designs. The result is that "firsts" are often hard to establish. For example, you might ask, "what was the first digital computer?" This title is frequently awarded to the ENIAC in 1945, but with slight changes in definition the title could equally be granted to a number of other devices, including Zuse Z1 as early as 1936. The problem would be even more complex had we an adequate understanding of the history of computing in the Soviet Union, but unfortunately, very few sources on Soviet computer development have been translated to English and the iron curtain remains a notable blind spot in the western scholarship of technology history. Had we the ability to read Russian and access to the USSR's secret files, we would almost certainly find ourselves with an even more complicated timeline. This phenomenon is not at all unique to computer history (consider the contested invention of radio) but the accelerated telecommunications and computing research of WWII and the Cold War, and the attached cloak of secrecy, makes computer history a particularly difficult case.

I bring this up to introduce some of the inherent complexity in discussing the history of instant messaging. Instant messaging is a relatively obvious idea which, at least in retrospect, descends naturally from letters, postal systems, telegraphy, and so on. Instant messaging was not invented by any one person, but instead arrived at by many people. Over time, each of these distinct lineages have coalesced (through the powerful mechanism of user expectations) into the fairly consistent feature set, interface, and technology stack that we call instant messaging today. There is a direct conceptual inheritance from the Roman post to Slack, but it is difficult to actually illustrate.

In the 2017 Wired article "The Secret History of FEMA," Garret M. Graff proposes the FEMA-precursor computer network EMISARI as providing the first cross-computer instant messaging system, in 1971. Various other sources agree. Of course, there is a matter of definitions here. Machine-local instant messaging systems existed at least by the mid-1960s (e.g. in MULTICS) and have probably existed in some form or other for as long as multi-user operating systems have, if not longer. While it may seem significant from a modern perspective that EMISARI was cross-computer (over a network!) while these facilities were not, the differentiation likely seemed much less significant at the time---this was a period when the predominant form of networking was to allow many user terminals to connect to one computer running elsewhere. The closest analogue to the internet of today was a single computer with many users, and this concept of computer networking existed (at least in spirit) well into the '90s in the form of AOL, CompuServe, and other "traditional" internet services, which presented a user experience directly based on the dial-in BBSs before them.

For another perspective on an origin for instant messaging, the military's AUTODIN system (more or less the direct predecessor to ARPANET and the internet) was passing messages at high speed by the 1960s. AUTODIN was based on telegraphy and intended to fill a use case more like telegrams, but AUTODIN was fast enough that, particularly in later years, it was used to pass messages composed directly on video terminals in a fairly interactive fashion. It's difficult to judge whether or not this qualifies as instant messaging, but it's certainly a step in the evolution, and AUTODIN has its precedents as well.

Perhaps the first to use instant messaging were telegraph operators who exchanged maintenance information---and banter---directly among themselves in between passing paid messages.

Given that it is difficult to nail down the "first" instant messaging system, another interesting question might be "what was the first successful instant messaging system?"

I do not intend to disparage EMISARI by not suggesting it for this role, but there was never a full-on nuclear onslaught to really test its capabilities. Once again, it is difficult to provide any answers. One particularly prominent early use of instant messaging, though, relied on the network operating system VINES.

I should preface this by explaining a bit the concept of a network operating system (NOS). The term NOS is somewhat overloaded, and today is often used to refer to operating systems for network appliances, like CISCO IOS and Vyatta. Historically, though, NOS more commonly referred to an operating system which was intended to be used as part of a network. Yes, today, the concept of an operating system *not* designed for use with a network stretches reason, but prior to the mid '90s the idea of networking microcomputers was not a complete given, and popular microcomputer operating systems often did not feature network capabilities. Windows, for example, did not feature native TCP/IP support until Windows 95. Prior to that point, Microsoft had focused on the NetBIOS protocol.

Several vendors offered network operating systems, with a couple like Novell NetWare managing to survive (at least as names) almost to this day. These operating systems were completely built around network-based features like network file systems, electronic mail, and printer sharing, and often used proprietary protocols. One protocol, though emerged as a de facto early standard: Xerox Network System or XNS, which was an early solution to run on top of Ethernet[1]. XNS is not actually older than IP (or at least probably isn't, depending on definitions), but it was fairly easy to implement, flexible, and Xerox released the specification to the public domain. As a result several NOS, including NetWare, were XNS-based[2].

One such XNS-based network operating system was Banyan VINES, for Virtual Integrated Network Service---truly a product name from the classic era. VINES would be a side note in the history of these NOS were it not for the late-'80s decision of the Marine Corps to standardize on VINES as its networked communications solution. The Marine Corps has a general reputation for selecting off-the-shelf products which can be quickly fielded rather than having custom products developed for their use (as is the norm in other military branches), and VINES is one of the success stories of this strategy. Just a couple of years after the purchase of VINES it was widely deployed during the Gulf War, carried over satellite modems to provide electronic mail and file sharing between forward operating bases. The system was viewed as a tremendous success, and the popularity of VINES instant messaging feature with the Marines is often mentioned as the origin of the modern military and intelligence community's love of chat rooms for tactical exchange. If there is but one case in which instant messaging provides true business value, battlefield tactical communication is a strong contender.

In the era of NOS the internet was in its infancy and not commonly seen outside of the DoD, universities, and major corporations. Networked computers in general and instant messaging specifically were not accessible or known to consumers until the spread of consumer internet services in the latter half of the 1990s.

While most consumer internet services had an equivalent, it is essentially impossible to discuss the history of instant messaging without emphasizing AOL Instant Messenger, or AIM. While AIM was initially an integrated feature of the larger AOL product (which was a "web browser" of sorts integrated with various other services), but by 1997 was offered as an independent application available to AOL and non-AOL customers alike. As a free service, AIM was completely ubiquitous among the youth of 2000-2005, and in general served as one of the epicenters of internet culture.

AIM was also influential in laying out the downfall of instant messaging. AOL was based on a protocol called OSCAR, which was proprietary but reverse-engineered by various other developers. This lead to a series of third-party AIM clients, including the influential Trillian which simultaneously supported a number of different messaging services of the era, such as MSN. AIM, an implementation of a proprietary protocol, set the pattern for numerous other messaging services to follow, including AIM's leading contemporaries Yahoo Instant Messenger and MSN Messenger.

This is not to say that the situation was entirely one of proprietary protocols. The open-standard internet relay chat (IRC) protocol dates to the late '80s and is still in some use today, but its use is quite limited compared to commercial services, and it has only lost ground. There are numerous reasons for this, likely chief among them the active marketing of commercial IM services, but perhaps the biggest reason is that IRC has largely not made the jump to the modern age. Since the '90s it has been generally less user-friendly and less feature-rich than the commercial alternatives. This is to some extent unavoidable as the IRC transport protocol is very limited and extension has always been somewhat awkward.

In the late '90s, a small company called Jabber set out to standardize an open protocol for instant messaging which was extensible, to incorporate new features. While it was called the Jabber protocol at the time, it was later renamed to Extensible Messaging/Presence Protocol or XMPP. The Extensible hints both at its ability to support a wide variety of use-cases and its heavy use of XML, which as we have established, is the pinnacle of computer science's many achievements. XMPP has seen some adoption for its entire life, but received a major boost as Google and Facebook both introduced IM services based on XMPP (Talk and Messenger, respectively). Because these two commercial IM services relied on XMPP, they had the dual advantage of being well-marketed, user-friendly, and supported by a variety of third-party clients.

Unfortunately, neither Google nor Facebook fully leaned into XMPP's federated design (that is, you could only use Google Talk to talk to other Google Talk Users), but still, this period of around 2008 was very likely the apex of instant messaging. Finally, instant messaging functioned according to a published standard with uniform good support. Naturally, this situation did not last.

Perhaps this is enough for now. In a future message, coming perhaps not that long from as I am about to take vacation, I would like to talk about the IM landscape post-XMPPs demise, and express some opinions about what is wrong and why we are stuck with it. I had previously published an essay on my personal website entitled "Obituary, to Gary Tomlinson and Email," and I will be not quite restating it but repeating many of the ideas in that essay, which represents some of my most depressing thoughts on the state of technology today. In likely a third message, I would like to extend this to some discussion of the current landscape of federated social media, and the defects and enemies it has in common with open-standard instant messaging. You can read that essay here: https://jbcrawford.us/writing/obituary

[1] It is an extremely important but oft forgotten fact of history that Ethernet significantly predates TCP/IP. Many misapprehensions about network protocols and their history come from the impression that Ethernet was somehow designed to carry IP, or that IP was designed to be carried over Ethernet. Neither of these are true, the close pairing of Ethernet and IP are awkward happenstance and as a result the two each have features that are redundant or confusing when used with the other.

[2] The history of the network protocols which proliferated in the '80s through early '90s is long and interesting, the kind of thing that could easily fill a book. It is remarkable the extent to which, today, we take universal use of IP for granted. Not that long ago, MacOS spoke AppleTalk and Windows spoke NetBIOS, and the two would not understand each other. And those are just the two lineages that last to this day---there were at least a dozen network-layer protocols being used in business networks in that period, many of which were carried over RS-232 or RS-422 for low cost. These have a tendency to pop up from time to time in the modern era, often with industrial automation and other "legacy" equipment.
--------------------------------------------------------------------------------
>>> 2020-08-01 crying for onions


*** In the interest of being up-front there is some mention of child sexual abuse in this one. It's brief as I intend to take up the topic in more depth in a future message, but you still might want to skip from paragraph starting "I'm kidding, Tor..." to "pearl-clutching aside" if you would rather not think about it. This is a topic that I think is important to discuss (for reasons I outline here), but it is not *easy* to discuss, and I hope that it is clear that I may make light of it only because that is my way of discussing everything. The issue is not at all a light one, and that is why technologists should to choose to engage with it.

I have been a bit busy lately due to some combination of finally deciding to commit to getting my private pilot's certificate and spending a greater than average amount of time getting angry at computers and resolving not to touch them ever again. However, I have finally returned to prattle on a bit longer about online privacy.

What I want to talk about today is: Tor.

I have always had a rather quarrelsome relationship with the Tor project. There are a few reasons for this, some technical and some not. Just for the sake of getting past the boring parts I'll dispose of the non-technical ones first: for one, prominent Tor developer Jacob Appelbaum (who often represented the project in public) was widely accused of sexual harassment, which a private investigator hired by the Tor project reported to be true. Because Appelbaum was the public face of the project to such an extent, this represented a bit of a black mark on the organization, which may have sat on the issue for a year or longer. Second, the Tor project has attracted funding from a wide variety of sources, most of which I personally feel was ill-spent supporting a project that has a good brand but poor credentials. But of course these are all issues which are quite separate from the actual technology, and *I* am here to complain about *computers*.

Let's talk about what Tor *is*. Tor is the most prominent implementation of a concept called "onion routing." The underlying idea is actually fairly simple and originates from some academic papers that led to the Tor project. Essentially, the idea is that if you route traffic through several "layers" of a network, each layer being unaware of the other layers (this blindness is achieved by encryption), no layer has the information from other layers to establish the actual origin of the traffic. An explanation that I personally think is simpler than the "onion" metaphor goes something like this: if you tell a friend to pass a message to a friend to pass a message to another friend, after a few rounds of this no one will be clear on where the message actually came from. This is basically what Tor does, but of course routing IP traffic requires having a return path, so Tor uses a cryptographic approach so that each "friend" is able to route traffic in the reverse direction as well but none of them know the route more than one hop in each direction.

So Tor routes your IP traffic through a series of nodes, each of which is blind as to the full traffic route. There are a couple of types of nodes, a Tor "node" or "router" in general is one that shuffles traffic around inside of the network. An "exit node" is specifically a node that is willing to be the final node in the chain, forwarding traffic into the public internet. Exit nodes are somewhat less common because the Tor network is widely used for various types of abusive behavior and the exit nodes, being the apparent origin of this traffic, tend to catch most of the flak for it.

If each node only knows the previous and next hops in the route, as is the scheme with Tor, then three hops through the network is sufficient such that no one node knows the source *and* destination of the traffic. This creates a form of anonymity: the nodes that know who you are don't know who you're talking to, and the nodes that you're talking to don't know who you are. In the scheme that we devised in the last post, this provides anonymity of your identity from the operators of the websites you access. This is the primary objective of Tor: to allow you to access web services without the operators of those services knowing who you are.

To be clear, the concept of onion routing is not specific to Tor, although Tor was developed in part by the author of the first paper to describe the scheme, so it is perhaps the "reference" implementation. Early on, onion routing was also notably implemented by the Mixmaster anonymous email system (onion routing tends to be high latency and so is naturally more suited for asynchronous email than real-time IP routing), which is primarily used to send bomb threats to state universities and which I contributed to for some time in my wild youth[1]. While Mixmaster is still somehow operational, no one cares about it, and onion routing is mostly associated with various IP routers of which Tor is by far the most widely used.

Tor was further extended with something called the Rendezvous System. The implementation is somewhat complex, but the basic idea is that it makes the privacy protections of Tor bidirectional. Instead of just protecting the identity of the user from the web service, it allows a user to connect to a web service without either knowing the true network identity (IP address) of the other. Very roughly speaking this means using Tor in a "hairpin" manner, sending traffic through the Tor network which loops right back into the Tor network to get to the other end. The rendezvous functionality is generally referred to as "Tor hidden services" and even more widely as "THE DARKNET," and it is the facility that allows you to go to a website whose URL is a very long sequence of random characters followed by ".onion" in order to purchase drugs.

I'm kidding, Tor as a mechanism of purchasing drugs is largely a failure at scale, because the drugs still need to be physically delivered (which provides all kinds of opportunities for law enforcement to detect and identify participants) and because Ross Ulbricht was not especially competent at running a criminal empire [2]. Tor is actually used for child pornography[3].

It should be clear by now that I am being extremely critical of the Tor project and painting it in a rather poorer light than almost everyone else, although I am certainly not the only person arguing that Tor serves primarily as an aid to what is, in the industry, called child sexual abuse material (CSAM). I am not sure that I want to spend this otherwise fairly good evening articulating the significant concerns that exist surrounding CSAM and internet anonymity services, however, CSAM is very much the elephant in the room in this area. Even after making somewhat light of the long run of bomb threats against the University of Pittsburgh (which at least did not result in bodily injury), I feel that I would be participating in common but ethically questionable activity to critically discuss Tor without addressing the issue of child pornography.

It is a well-known reality among people familiar with internet anonymity systems that the majority of anonymous internet content distribution systems, whether old or new, centralized or decentralized, have seen a significant amount of use to distribute CSAM. This is a complex issue and there is clearly a certain amount of moral judgment involved in establishing whether or not these services are a net negative or positive for society. However, I firmly believe that progress in internet anonymity technology requires that we acknowledge and grapple with this uncomfortable fact. The vast majority of internet anonymity projects have addressed the problem of CSAM by simply ignoring it. I do not feel that this is excusable. The issue is not merely one of CSAM, CSAM is simply the most obvious case and the case which is most heavily pushed towards advanced anonymity technology because of aggressive prosecution by law enforcement. The broader issue is that all anonymity technologies are highly subject to a wide variety of abuse. Consider, as another example, anonymous social media like Whisper and all of the problems it has been associated with.

To pretend that the matter of abusive use (whether towards children, other users, people's email inboxes, etc) is a social or "non-technical" problem and thus not a consideration in the design of systems is, in my opinion, a regressive view that keeps the development of these technologies in a sort of "silicon tower" and promotes the continued development of technologies that are increasingly complex but fail to address actual social problems. Ethical and human safety concerns require that designers, developers, and operators of anonymity and privacy technologies take a holistic view in which they consider the way that their technologies engage with real-world behavior and impact real people. Taking a techno-libertarian, crypto-anarchist view of the matter and proclaiming that "information wants to be free" and anonymity technologies are indifferent to their uses has both a human cost (to a real extent literally, in the form of excess deaths as a result of these technologies) and a technology cost in that this view actually stifles attempts to develop a technical approach that is aware of and responsive to these ethical problems. Technology which fails, and especially willfully refuses, to address social realities is technology which is not fit for purpose.

Clearly this is a complex issue and I have just expressed a great deal of opinion, some of which is less technical and more moral and political. I intend to write an essay explicitly on this topic in the future, but it's not a very easy essay to write as the considerations involved are many, hard evidence on the issues is slim, and my opinions on the topic seem to be directly opposed to those of many technologists, and so I feel that I must be more effective in persuasion. Normally I wrote about how computers were a mistake, an opinion which is remarkably non-controversial among professional users of computers, and so I don't have to try very hard at all.

Okay, so, pearl-clutching aside, let's get back to the technical considerations. What is Tor good for?

As I mentioned, the primary function of Tor is to protect the identity of the user from the web services with which they interact. With the addition of the rendezvous system (the "dark web"), this protection becomes bidirectional, protecting both the user and the service from their identity being known by the other.

This is very neat. I don't want to seem like I am downplaying the achievement, there is very real technical achievement in designing a system which can meet these goals. However, if I was at all successful in articulating the previous post, you will know that I feel that privacy technologies generally fail to communicate their actual capabilities and limitations to users. Even this feature of Tor is an important example.

Remaining anonymous online is very difficult. As I suggested in the previous message, there is a strong temptation among internet privacy services (whether for-profit or not) to present online privacy as simple. For VPN services, and even for Tor, it often comes down to IP address: if they can see your IP address they can identify you, if they cannot, they can't. In reality, the user's IP address is a minor consideration in most analytics and tracking schemes, since it is prone to changing unexpectedly anyway. As a result, mere use of Tor does virtually nothing to protect user privacy as users can trivially be re-identified in other ways.

The Tor project has significantly improved this situation, more than any other project, by developing the Tor Browser. The Tor Browser is a modified version of Firefox which is dedicated to the purpose of browsing via the Tor network. While it has various privacy-centric features built-in (such as disabling Javascript by default and taking various anti-fingerprinting actions like fudging the viewport dimensions), by far the most significant feature of the Tor browser is simply being a *different* web browser, which means that it will not, from the start, have the user's Facebook cookies.

This serves as a critical precaution against what I would say is the most difficult problem in online privacy: no matter what you do technically, people have a habit of identifying themselves. For every development to mitigate Javascript fingerprinting, there are five hundred people who use the Tor network to log into their email. At this point they are, of course, completely identifiable to their email service, and as a result are very probably identifiable to a variety of other people based on various means of correlation. The Tor network and browser do feature various mitigations for this issue (like the ability to force selection of new routes) but they have limited efficacy and are probably ignored by most users.

The bottom line is that technical means of anonymity are very often foiled by user behavior. Changing user behavior to prevent users revealing their identities is extremely difficult and not something which is really amenable to technical solutions. Throwing additional plugins into Firefox to prevent identifying user behavior will always be a game of whac-a-mole. While law enforcement has in some cases re-identified users of Tor using technically sophisticated methods of attacking the privacy technology, they have more often re-identified users by using "old-fashioned policework" like noticing that they use the same stupid username on crime forums as they did on MySpace in 2005.

So, I have hopefully convincingly argued that Tor is not an unmitigated success in protecting user identity. Certainly it has benefits, but the privacy it provides is not absolute, and really taking advantage of Tor requires a relatively sophisticated understanding of the practical and technical issues surrounding online privacy, tracking, etc.

As our dearly departed Billy Mays would say, wait, there is more. While Tor provides certain privacy advantages, it also provides a privacy disadvantage: the Tor exit node handles all traffic in cleartext, so the exit node used to route your Tor connections has visibility into all of your traffic to the public internet. This is somewhat (but not completely) mitigated by the use of HTTPS, but it is still clear that significant surveillance on Tor usage can be obtained by operating an exit node which records traffic. The utility of this kind of collection has not been well established in the open literature but it is known that government organizations have operated Tor exit nodes for this purpose. It is speculated (and supported by evidence) that the initial documents published by WikiLeaks were extracted from traffic intercepted by malicious Tor exit nodes and later shared with Assange, potentially by Chinese intelligence, although attribution in these matters is always shaky.

The point is that alongside its privacy benefits, Tor also presents a privacy concern by intentionally introducing an on-path attacker[4]. This concern is not merely speculative but realized.The extent of the risk is probably not enormous in typical usage but is also not well understood. The history of confidentiality research is generally one of finding that metadata exposes more information than anyone had ever expected, and so it seems like we ought to err on the side of overstating the unknown risk rather than understating.

Tor provides a privacy benefit but also a privacy concern, and users of the system must weigh these against each other to inform their behavior. But, the public marketing of Tor expresses very little of this. This reflects my underlying concern: marketing and public discussion of privacy services fails to express the real capabilities, limitations, and privacy benefits of these systems, which are virtually always less than said.

But then there's a whole other matter: that of countercensorship.

Remember that?! We haven't talked about it for a while. A different but closely related topic to online privacy is that of countercensorship, allowing users to access content that someone (their local segment, national government, etc) doesn't want them to access. This is actually the most widely known and discussed benefit of Tor in most contexts. And yet, it is a purpose that Tor is fundamentally unsuited for.

Tor has a property that many distributed internet systems have: in order for the Tor network to function, a good portion of Tor nodes must be aware of the existence and network identity (IP address) of a good portion of the other nodes. This property (which is shared with other popular systems like BitTorrent) means that it is fairly trivial to collect a list of all nodes participating in the Tor network. Many commercial services collect this data and offer it for a modest price.

As a result, Tor is basically ineffective as a means of countering censorship. Network operators or regimes which wish to censor internet content simply censor Tor entirely, blocking access to all known Tor nodes. This is a standard practice in the oppressive regimes for which Tor is most widely advertised.

Of course the Tor project has a solution to this problem, called Tor bridging. Tor bridges are Tor nodes which do not provide standard routing and so are not readily discoverable from the Tor system. This makes it less likely that censors are aware of them, and so if a user behind such censorship can discover Tor bridges by a side channel they can connect to them for access to the Tor network. Tor bridges generally support various ways of obfuscating the Tor protocol (which is otherwise fairly easy to identify on the network) so that censors don't block connections to bridges based on fingerprinting of the protocol.

The adept reader might be wondering: if a Tor bridge can be used to connect to the Tor network from behind a censored connection, couldn't it simply be used to connect to the broader internet?

Indeed, a Tor bridge is essentially a proxy or VPN (the two are almost synonymous in this kind of usage) dedicated to providing access to the Tor network. From a technical perspective, such an obfuscated and hard-to-discover node, which is cooperative in the purpose of evading censorship, could simply forward traffic to censored services on the behalf of a user without any use of the Tor protocol. In fact, this is an old and still reasonably widely used method of evading censorship, and is often as simple as a website that will retrieve another website on a user's behalf. High school students 'round the globe are using such services to access adult websites during computer lab. Obviously there are privacy implications to these services, but they can be designed in a reasonably privacy-preserving way that presents the same exposure of user data as Tor.

So, if the mission of the Tor project is at least in part to allow those under oppressive regimes to evade censorship, why doesn't it provide such relays for more general use? I propose a reason which is a bit unconventional but I believe to be ultimately true (even if it is not *consciously* the reasoning behind the decision): internet proxies and relays are highly subject to abuse by all kinds of malicious actors, for example the ones that fill out the contact form on your website fifty times a day offering "negative SEO." Restricting Tor bridges to providing access to Tor makes them unattractive for this type of use, both because Tor is very slow and because many websites block Tor exit nodes because they generate a high level of abuse.

The bottom line is this: for many users in oppressive regimes, who need to use Tor bridges, Tor isn't used as a countercensorship mechanism at all. It's used to degrade the service of the Tor bridges so that they are less useful for abuse.

The Tor bridges are what *actually* provides the ability to evade censorship. The whole Tor network on the other side of them just makes it harder to fill out a thousand contact forms per minute.

In a way this is brilliant, because it seems to allow Tor bridges to persist longer than other types of proxy/countercensorship services. Tor bridges also seem to have seen a higher level of development for obfuscation techniques than other methods, although arguably VPNs provide a better level of obfuscation merely because the same protocols used by VPN services are also often used by corporate networks.

That said, Tor does not really provide a countercensorship function, when we get down to it. All we need for countercensorship is a node, which we can access, which will cooperate on forwarding traffic on our behalf. There are difficult parts of this problem (namely developing a way for users to locate these nodes without the censor being able to locate and block them), but the Tor project does not address these difficulties. The Tor project simply recommends that users find out the addresses of bridges by other means, like in person or through messaging services or etc.

Let me restate this, because this is a pet issue of mine and I am presently in a surly mood: there is a difficult problem in countercensorship, but the Tor project does not address it. What the Tor project does address is making their conventional countercensorship mechanism less effective so that it will attract less abuse.

As before I have injected a great deal of opinion into this discussion. There are use-cases which Tor addresses which more conventional countercensorship approaches (web proxies, VPNs, etc) do not address, most significantly the case where the person evading censorship is putting themselves at personal risk while accessing a service they do not necessarily trust, and so they desire stronger protection of their identity from the services they access. Of course this is subject to all the caveats and limitations I discussed earlier, but it is something that Tor is capable of addressing that simpler methods cannot.

That said, I do not think that this case is actually that common or important. If Facebook's decision at one point to offer a Tor hidden service tells us anything, it's that people use Tor to evade censorship in order to access Facebook (this was actually their explicit stated goal in the move). These are clearly not people who are trying to obscure their identity. I mean, sure, you can in theory sign up for Facebook without divulging your identity, but participating in Facebook in such a way as to not make yourself re-identifiable would be a difficult venture requiring a fairly high skill level.

Tor is widely advertised and recommend for purposes that it is either unsuitable for or (like nearly any technical solution) provides only limited utility for. This will almost certainly lead users to trust the technical solution to protect them, something that it cannot really do, and this will lead users in dangerous situations, say journalists under oppressive regimes, the group that the Tor project really advertises itself to, to place themselves at risk by participating in illegal or "dissident" activity while still being identifiable.

The Tor project, in the hero banner of their website, says "Defend yourself against tracking and surveillance. Circumvent censorship." I have argued that it has limited utility for the first sentence, and is basically unsuitable for the second sentence. And yet, it is one of the most widely used solutions for both, because it has an excellent reputation developed in good part on the back of extensive corporate, government, and nonprofit funding.

I do not mean to accuse the Tor project of having ill intent. I completely believe that everyone at the Tor project is sincerely doing their best to address these real-world problems. Most people at the Tor project are no doubt completely aware of all of the problems I have raised, even things like Tor's basic unsuitability for censorship evasion, but believe that they are presenting a good trade-off to their users by providing a limited set of benefits in a highly user-friendly package. I respect and appreciate them for this effort.

However, I feel that, like commercial VPN providers, the Tor project has placed the acquisition of users and funding over the actual security of their users. Because of their desire to be user-friendly, popular, and well-funded, they make promises which they are not technically able to keep.

To summarize by somewhat rough analogy, many privacy and anonymity technologies could be compared to a handgun[5]. You can sell someone a handgun by telling them that it could defend their lives, and this is not untrue, but they are more likely to shoot themselves in the foot. And yet, all you tell them is that it will defend their lives, because you are in the business of selling handguns. Privacy advocates are in the business of selling privacy and subject to the same errors. It is (probably) not impossible to promote these products in a conscientious and utility-maximizing fashion, but it is significantly more difficult than selling them the easy way. If we are going to address the problems of online privacy and censorship, we need to learn to do it the hard way that works, not the easy way that only feels like it.

[1] I mean that I contributed to the Mixmaster project, not the steady stream of bomb threats it handles. It was a different era, back when Usenet was just dying and not quite dead.

[2] Incidentally, Ross Ulbricht was arrested at the branch of the San Francisco Public Library that I used, after having gone there because he couldn't find a seat at the next door coffee shop from which he apparently usually operated the Silk Road. I was astounded by this fact because I frequented that coffee shop for the time I lived in San Francisco and there were *never* seats available. There is something delightful to me about Ross Ulbricht, international drug kingpin, standing in the front of the tiny coffee shop frowning at the other patrons taking all the seats at his preferred drug empire command post.

[3] It is virtually impossible to actually establish the typical usage patterns of Tor hidden services. The Justice Department once claimed that 80% of Tor traffic is child pornography but I am not especially familiar with their methodology and am inherently skeptical of their opinions in this area, given the DEA and all. That said, if I were in the mood I would make a lengthy technical argument that child pornography is almost the only purpose for which Tor hidden services are actually technically suited, and everyone using them for another purpose has been had.

[4] The term "on-path attacker" is actually not especially familiar to me, but I am steering towards compliance with draft Best Current Practice RFC "Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs" which suggests it as a replacement for "man in the middle" which avoids the use of gendered language. While it is a bit of adjustment for me I also appreciate that "on-path attacker" more concisely expresses the technical concept.

[5] I both own handguns and do not desire to be on the receiving end of second amendment arguments, so please don't interpret this as a condemnation of firearms, but really this only serves to emphasize the idea that privacy services are something that have the potential to do good in some situations but the potential to do harm in others, which has the effect of requiring that they be promoted with the utmost caution.
--------------------------------------------------------------------------------
>>> 2020-07-18 what is privacy

Something that is an ongoing irritation to me is the discourse and marketing around online privacy and anonymity tools. There is a great deal of misleading discussion and confusing argumentation not only in marketing and online discussions but also in the security community, where those who will let perfect be the enemy of good (namely: the security community) often make broad statements about the security properties of various technologies without actually considering the threat model... and thus make broad statements which are broadly false.

For a rare occasion I would like to try to convey something useful, which is, a threat-centric approach to understanding online privacy and anonymity services. Then I will kvetch about technology as usual.

The biggest reason that discussions about online privacy go astray is because both "online" and "privacy" are words that seem innocuous enough but encapsulate an enormous realm of different and sometimes contradictory elements. "Privacy" means different things for different people and contexts. Let's turn to our Study Tech and approach the matter by defining words. Privacy, at least in the information assurance context, is best thought of as being the confidentiality of *something* from *someone.* That is, I feel that I have privacy when some thing, person, or group of things or person is not able to obtain some set of information about me.

The thing/person/group (subject) and information (object)---the subject and object of privacy---are highly variable depending on the context. In common situations, people expect many types of privacy. We expect that the letter carrier does not read our mail. We expect that the police are not tracking our location. Think about the internet, though: do we expect that gmail is not reading our mail? Well, it is, in a sense. Privacy can be a complicated topic in that the expectations---in terms of subject and object---vary from person to person and case to case, and yet we have a tendency to refer to the whole thing simply as "privacy." You can imagine that this makes discussions of privacy inherently confusing, and some people seem content to perpetuate the confusion because they see discussing privacy without defining the case they are considering as being a sort of argument for that case (consider when people talk about "freedom," explicitly including something like free speech in the definition of freedom, in order to make a point).

So, at fear of sounding too much like a rationalist, in order to discuss online privacy we must first define our terms. When it comes to the object of privacy, there are two significant and useful definitions:

1) Privacy of the *payload*, that is, the contents of the webpages we view, files we download, messages we send, etc.

2) Privacy of the *metadata*, that is, the IP addresses (and implicitly services/websites) that we communicate with, how much, how often.

3) Privacy of your *identity*, that is, your IP address and other identifying information.

Let's think about these two practically. Object 1, payload, is protected by TLS (HTTPS). Object 2, metadata, is protected partially at best by TLS. While e.g. TLS with SNI can provide some protection of metadata in certain situations it does not provide effective protection in the general case.

Let's also think about subject. This is where things get a bit more complicated:

A) Privacy from your ISP, network administrator, other users on your local network. We will call this the local segment.

B) Privacy from the broader internet, that is, transit providers. We will call this the internet segment.

C) Privacy from the operator(s) of the services you connect to, e.g., privacy of your identity from a website you view. We will call this the remote segment.

You can see that there is a certain relationship between the subjects and objects in common online privacy concerns. When we are talking about subject C, the remote segment, it is almost exclusively subject 3, our own identity, that we might worry about. The website that we are connecting to will obviously know that we are connecting to them and retrieving certain data, but we might not want them to know *who* we are. And of course in many cases we don't care, because the website we're using might be one that we explicitly identify ourselves to anyway. Say, our bank.

Finally, there are some concepts that are in fact entirely separate from privacy but are still often co-mingled with privacy in discussions of technology and policy. The most prominent such concept is "countercensorship," which is the desire of users to view concept that someone does not want them to. Countercensorship also varies by subject, that is, censorship is often discussed in consideration of two different subjects:

1) The local segment---that is, a user's ISP or even national government desiring to prevent them accessing certain content. The case of a national government may actually be found in the internet segment, but for practical purposes the situation is the same for countercensorship. Someone *on the connection path* is trying to prevent a user reaching content.

3) The remote segment---websites may decline to provide certain content to a user but the user might desire to access it anyways. The most common case is that of region-locking, in which say BBC iPlayer refuses to stream BBC TV series to users it believes to be outside of the UK.

There are various types of technologies and services intended to handle different combinations of all of these cases. However, the requirements of these different types of privacy and countercensorship are different and can be contradictory. This means that a service or technology which is suitable for one situation may be unsuitable for another situation.

This is the real problem that frustrates me endlessly: various services and technologies are constantly promoted for "privacy," and while they might possibly be useful for one case they are almost always entirely *unsuitable* for another case. Users do not understand this, and people hocking various services do not attempt to educate them, and so users who are concerned about their privacy are induced into making choices which, in fact, compromise their privacy, and sometimes by the well meaning. One of the groups that I consider guilty of this offense, for example, is the EFF, through their breathless promotion of the Tor project with little consideration of the utility of Tor in specific situations. Instead, it is presented as a silver bullet for both privacy and countercensorship---applications which it is not always ideal for, and is sometimes counter to the purpose.

So to start discussing online privacy within this framework, let's look at a technical (and commercial) solution which is widely advertised for user privacy: VPNs.

VPN stands for Virtual Private Network. The "Private" in this acronym is intended in a completely different context and it's best to ignore it, in the context of "the VPN" as it relates to common internet users it is generally deceptive. From a technical perspective, a VPN is a technology which allows a network to be virtualized on top of a different network. For example, it allows a corporate internal network to be "extended" over the internet to the devices of employees who are working remotely, or for two different physical locations to have their local networks "unified" to a single network.

This is all rather boring to end users. To end users, or increasingly anyone who's ever heard of a YouTube video, a VPN is a service which makes the internet private.

Commercial VPN services such as Private Internet Access, NordVPN, ExpressVPN, the new Mozilla thing that probably no one uses, etc. are best viewed as services which take your computer and place it, logically, on the network of the operator---instead of the network you are currently connected to. This explanation, more than others, might help people to understand the privacy and security properties of such services.

So let's examine this from the perspective of subjects and objects of privacy. VPN services can easily protect objects 1 and 2 (payload and metadata) from subject A. That is, the use of a VPN service makes it so that the local network segment, the coffee shop WiFi you're on and/or your ISP, cannot view your traffic (even if unencrypted) and where it's going. They may still be able to collect certain metadata because VPNs are imperfect, for example, traffic volume, which research has found can sometimes be used on its own to derive useful information about payload. However, a VPN certainly provides stronger protection of 1 and 2 from A than not using one.

This protection of your traffic from the local segment is the primary function of a VPN. It is one of the key functions of corporate VPNs (in a client-to-site scenario) and the most significant value which can be derived from a commercial VPN service such as NordVPN, etc.

VPNs do not generally provide protection of objects 1/2 from subjects B/C, because once your traffic has departed the VPN provider network it traverses the internet the same as it would have otherwise. However this situation is more than sufficient for most people, privacy from subject C (a website/service operator) will always be limited by the fact that they necessarily have access to payload. Privacy from subject B (the greater internet, or transit providers) is generally of less concern to consumers because surveillance of internet transit is uncommon and generally limited to state actors. By far the greatest surveillance (and tampering) risk exists on the local segment.

VPNs may have some secondary value in protecting object 3 (your personal identity) from all three subjects. This basically occurs because most VPN providers present a large list of users as a single network identity (NAT is the technical mechanism), which entirely prevents methods like IP geolocation and also makes more sophisticated methods of identifying users somewhat more difficult because they tend to become "lost in the noise" of multiple users sharing the same address. However, the protection offered here is severely limited, and in general *should not* be a selling point of VPN services although it often is.

Protection of personal identity is most effective in the case of eavesdroppers on the link as it frustrates analysis via e.g. deep packet inspection, since numerous users will appear as having the same source IP. However, even this assertion relies on a long list of assumptions, perhaps most important of them being that your network traffic is indistinguishable (by means other than source IP) from other users of the same service coming from the same node of the same VPN service. This may sometimes hold out for very busy/popular services, but in many cases you will be the sole user from that node of that VPN provider, there are fingerprinting methods available even to DPI, etc. In general, VPNs are not really designed to protect users from DPI occurring elsewhere on the internet and cannot be expected to be effective in doing so in the general case.

Let's consider the case of protecting your personal identity (3) from website and service operators (C). This is perhaps the use-case for which VPNs are most misleadingly advertised and sought by some users. The use of a VPN service provides virtually no protection of your identity from the websites or services that you access. There are multiple reasons for this, but here are the several most compelling:

* You may willingly provide your identity to many services and websites, e.g. via logging in to an account. This negates any privacy protections. This may seem obvious, but it is somewhat baffling how frequently people use "privacy tools" to log into Facebook with the expectation that it somehow mitigates Facebook's knowledge of their identity and behavior. There may be very limited cases in which it does, but in general, there is no value to most privacy protection technologies when you use them to access a service that you provide your identity to.

* VPNs provide no mitigation against conventional fingerprinting methods, which rely on behavior of your web browser and features provided by your web browser to uniquely identify you. Because users normally roam between networks as part of normal practice, most advertising, analytics, etc. networks will seamlessly identify you between using a VPN and not using a VPN, without even any detection that anything is abnormal. In most modern surveillance contexts you are identified by fingerprinting, not by network origin, and use of a VPN does nothing to deter this.

A final consideration about VPNs is perhaps the most critical. We have seen that VPNs provide good protection in some cases (of payload and most metadata from the local segment) and limited protection in some other cases (of personal identity from the internet and website operators), although this protection is so limited that I feel it to be ethically very questionable to advertise it. And, when I say "advertise," I don't mean only in commercial advertising. I would apply this admonition equally to any number of well-intentioned "online privacy guides" and etc. that advocate the use of VPNs as a "privacy measure" without an explanation of what protection they provide---and more importantly, what protection they do *not* provide.

Many, even in the security community, will justify recommending methods of limited efficacy by the fact that they *do* provide some benefit in limited cases, and so it is better to use them than not. That may very well be true in some cases, but there is a significant hazard to end-users who are falsely confident. That is, users who believe that they are "protected" may put themselves at risk because of the assumption that they cannot be identified. This is especially true of individuals in more critical situations where such privacy mechanisms are often recommended---journalists, subjects of oppressive regimes, etc. This makes it not only a matter of technical correctness but also moral correctness to educate users as to the limitations of privacy technologies.

The whole thing is further complicated by the fact that VPN services have become a Big Business, and so there is a great deal of paid promotion. Despite a strong tide of opinion (and law) against this practice, there is still plenty of unacknowledged or "native" promotion for VPNs to be found. This creates the unfortunate situation where even well-meaning people recommending VPNs to friends and family may inadvertently be acting as an agent of a commercial promotion scheme that is motivated by profit, not by any sense of privacy, safety, or security.

The situation is perhaps the most extreme when we consider the potential risks to privacy and security from using a VPN service. Recall the technical model of the behavior of a VPN: it replaces the local segment (local area network and consumer/commercial ISP) of the user with the local segment of the VPN provider. All of the risks once to be found on the local segment are still present on the VPN service's local segment. As a user, all you have is the VPN provider's assurance that they act in your interest and apply best practice security measures to their own internal network and internet service arrangement.

Considering the low price and fast multiplication of these VPN services, it is inevitable that there are problems in this area even without any malicious intent. The majority of VPN providers today operate out of a relatively small number of low-rent colocation facilities, often simply white-labeling nodes provided by a different VPN service (you can detect this simply by observing that many VPN providers have exactly identical lists of nodes). They may have no one on staff with significant technical expertise. They may not have invested in any security program whatsoever. If they operate their own infrastructure, they may be devoid of the most fundamental secure practices, such as limitation of privileges and patch management. All of this adds up to an inevitability: commercial VPN providers will inevitably experience security incidents which compromise their users privacy. This is true of all providers but especially of the low-end providers which offer large numbers of nodes at very low prices, often consist of just a few people without technical expertise, and problematically often market the most heavily by more questionable means (such as "native" social media campaigns, e.g. "influencers").

Look at it this way: there can be very good reasons not to trust your local network segment. Some of the US's largest ISPs have displayed decidedly anti-consumer behavior and made it clear that they have little concern for consumer privacy. However, do you trust $5 MegaFastVPN more than your own ISP? At least Comcast *has* a security program, even if its concern for consumers is questionable. Many of these VPN providers could likely be subject to malicious outside surveillance for an extended period of time without knowing. This is especially true since so many lower-end VPN providers share infrastructure which is itself obtained from low-end colo and dedi providers with severely limited security programs. Considering that these VPN providers and especially infrastructure providers to VPN providers concentrate a large amount of consumer traffic into one soft target, they become extremely attractive to malicious actors.

There was recently a significant database breach of a commercial VPN provider (which provided services to many whitebox providers), which released about 20M log entries from VPN providers which "retained no logs." In this particular case there appears to have been knowing deception involved as the data in question came from an ElasticSearch instance (you don't feed logs into ES if you don't intend to use them), but there is of course a substantial aspect of incompetence involved as the ES instance was left exposed to the internet and unsecured (a *remarkably* common mistake with ES which, by default, listens on all interfaces and requires no authentication... also a sure sign of utterly lacking basic security practices). However, it's easy for this kind of thing to happen out of pure incompetence. A great deal of network and management software retains logs by default, truly asserting that you "retain no logs" would require a degree of technical competence and effort, and ideally an auditing program to ensure ongoing compliance, that inexpensive VPN providers do not offer.

I am not necessarily here to give advice, after all, I'm probably not licensed to give computer advice in your state[1]. The problem, though, is that it's very hard to give advice in this area for two reasons. First, the concerns and behavior of users differs, and this impacts what privacy measure they should take. Second, commercial VPN providers are quite frankly a cesspool of questionable practices and I have a hard time trusting even the most reputable. There are probably a few things we can state with some confidence; PrivateInternetAccess is probably more trustworthy than the coffee shop's open wireless network, and in fact, the scenario of untrusted (shared, open, etc) local networks is the situation in which I have an easy time recommending the use of a VPN. But the sheer number of bad actors in the VPN space make me extremely hesitant to ever tell another human being that they ought to look into one. They are likely to misunderstand the privacy protections as stronger than they are, and even worse there's a good chance that the VPN provider could itself turn out to either be a malicious actor or compromised by one.

Here's perhaps my best advice: if you're concerned about security and privacy on your local network segment you should use a VPN that you operate yourself (not too difficult if you have a background with Linux) or use my NordVPN affiliate link^w^w^w^w^wone operated by someone you know. Just ask the nearest neckbeard and smile and nod when they start going on about WireGuard. But all of these commercial VPNs are a disaster.

I hadn't really intended to only cover the topic of VPNs in this post but I did and it's already pretty long, so let's declare a multi-parter. Join us next time to talk about some privacy and countercensorship technologies other than commercial VPNs, and why I hate those too.

[1] I've been admitted to any number of bars, but it's the bouncers that have refused me that you probably ought to ask.
--------------------------------------------------------------------------------
>>> 2020-07-15 some more formats

Let's talk about some more formats. Last time I basically left myself an agenda for the next message, so I'll do my best to adhere to it for once.

### Fixed Width Fields

Fixed width fields are a common feature of older data interchange formats. For the unfamiliar, the idea of a fixed-width field is simple: if you have, say, three fields for each record, just say that the first one is 10 characters long, the second 10 characters, and the third 10 characters. Now you just pad or truncate each field to fit. The main advantage of fixed-width fields is that they make parsing very simple because the parser just grabs the next so many characters to get each field. The downside is that it is inefficient when values are shorter (due to padding characters) and loses data when values are longer (due to truncation). As a result, fixed-width fields are generally only suitable when you have minimal variable-length data. For example, fixed-width formats can be a good fit for accounting applications when you have a strong sense of how many digits will be in the numbers you deal with and can accept needing to provide some kind of special-case handling when a number somehow turns out to be longer.

As you can imagine, in most cases fixed-width fields turn out to be too much of a hazard (in terms of technical debt) for practical use. If you tilt your head just right, the whole Y2K fiasco was basically a result of the choice of fixed-width fields that were too short to meet future use-cases. Sure, the year field maybe *always* should have been four characters, but in 1975 two characters seemed like plenty to meet the need. Just like how eight characters ought to be enough for Unix usernames, and for filenames plus a three character extension. All of these arbitrary limits were great and fine.

And yet, fixed-width formats were quite common in earlier computer systems and still pop up today, mostly in relation to legacy systems and formats. Let's think about why.

The first reason is the punched card. Punched cards have varied in length historically, but when you say "punched card" what most people think of is 80 columns wide[1]. The 80-column card dates back to 1928(!), but is widely known today as the "FORTRAN Statement Card" because it was the format adopted by FORTRAN, which became their most popular application, and so most of these cards seen today literally say "FORTRAN Statement" on them regardless of what they were actually used for. Because FORTRAN was designed for these cards, earlier versions of FORTRAN (such as F77) imposed the restrictions of punched cards even when reading source from text files. This includes a limit of 80 characters for each line and special meanings for the first several columns---such that a FORTRAN statement always began in the 7th column. FORTRAN 99 relaxed these restrictions and allowed for modern use of indentation.

Because punched cards were a fixed width, there were already specific limits imposed on the length of fields, and so it made sense to divide them in a fixed-width fashion. In fact, the reason for doing so is less logical and more physical, because punched cards (including the 80-column variant) were first introduced for use with purely electromechanical machines which had to be designed or configured (by jumper wiring) to understand that certain columns belonged to certain positions. These mappings could not easily be changed, ruling out variable-length fields.

Fixed-width fields were widely used throughout computing of the era but were particularly important in COBOL. One of the features of COBOL was its built-in data model. COBOL essentially had a concept of data structures (somewhat like c's enums but more sophisticated) which were natively serializable to cards, tape, or files. They were natively serialized because they were already stored in memory in a simple linear format using... fixed width fields. When describing a record format a COBOL user had to provide the length of each field in characters, including numeric fields---which made plenty of sense because numbers were almost always represented in BCD at the time, so number of characters and numeric precision were the same thing.

So, in essence, a FORTRAN record was a string of characters, and the record format indicated which character offsets corresponded to which fields. Records were both manipulated in memory and written to cards, tape, and disks this way. Fixed-width fields remain especially prominent in fields with significant historic use of COBOL, such as the finance industry, where for example the automated clearing house (ACH) system is based on fixed-width-field text files moved around by SFTP. The use of fixed-width fields in banking computer systems is also the basic reason why the charge descriptions on your credit card statement are INSCRUTABLE ALLCAPS ABBVTD S. In addition to its use of fixed-width fields, COBOL was frequently used on systems which supported only uppercase characters (either as a limitation of the computer's code page or as a limitation of the terminals)[2], and all-caps has been remarkably long-lived as a Thing Computers Do In Bureaucracy.

Fixed-width fields are rarely used in "modern" text-based interchange formats because of their poor ergonomics and the obvious problem of determining the correct field length. That said, fixed-width fields are of course in widespread use in non-text formats including most types of binary serialization. Considering that, for many purposes, your computer never handles numbers of any length other than four bytes anyway, it makes sense to use a fixed four bytes for them.

### Field Separators

More logical to us today than fixed-width formats are formats in which fields (on each line) are separated by some type of delimiter. The idea of reserving some character that is not likely to appear in the actual data to serve as a delimiter is one with a long history. As a notable example, in his paper on the Entscheidungsproblem, Alan Turing used the schwa to mark the end of data on the hypothetical machine's tape. Besides being a reasonably obvious idea Turing was likely aware of pre-computer precedents as well, such as telegraph operators using a distinctive symbol to mark the end of each message. Turing actually referred to these characters as "sentinels," but today "delimiter" is the norm.

It might seem that an obvious criteria for delimiters is this: the delimiter should not normally appear inside of the field. If a field contains the delimiter which will be used to mark the end of it, it will be necessary to somehow mark the delimiter character as "but not really." Today we refer to this as "escaping" the delimiter character, although the term is somewhat confusing in this case. "Escape codes" were originally sequences that literally began with the escape character, but the term was later expanded to describe any sequence of characters which start with a certain special character and encode a meaning as a single unit. So, an example to make this concrete, in many modern programming languages we may use single quotes (') to delimit the start and end of literal strings. A literal string may sometimes contain a single quote, so we have to "escape" that single quote, except that instead of the escape character we use the backslash. \' is a special sequence, identified by starting with a backslash, that means "this encodes a ' but is not a delimiter'. I like to avoid the term "escaping" in reference to a delimiter because this kind of use of escape sequences is actually a special case, escape sequences are a much more general concept, and so it's slightly confusing to learners (although very common) to use the terms "escape sequence," "escape character," "escape code," "escaping," etc. to refer to all of these things that are not obviously related.

You can see that this whole thing about escaping is kind of a hassle, so we want to ideally eliminate it but at least minimize it. That means selecting a delimiter that never or rarely occurs in the data. ASCII provided a convenient mechanism for this: the first 32 ASCII characters are "control codes," and as many as a dozen of these (depending on definitions) are dedicated to marking the start and end of things. So this appears to be an open and shut problem: we need special characters to delimit things and there they are. But, in practice, these control characters are very rarely used. There are a number of reasons for this, but the most obvious and realistically most significant are simply poor ergonomics. There is no button on the keyboard for ASCII 1f "Unit Separator," and no one [who speaks English] likes to use characters that aren't on their keyboards. Further, there is no well-accepted convention for displaying these characters. They basically rule out any sane hand-editing of data.

So, instead, "printable" characters (meaning ones on your keyboard) are generally used as delimiters. This presents a problem since the characters on the keyboard are all reasonably likely to appear inside of data. Early on, it was very common to select characters like |, \, `, and ~ as delimiters because they are rarely used in text, and really only exist in ASCII and on standard keyboards by happenstance. The | was particularly popular because it resembles a dividing line and was already used on typewriters to make vertical rules in tables. In general, it was an obvious and relatively ideal choice for a field delimiter. Today pipes are still often used as field separators in certain log formats, especially in the POSIX world.

But, of course, far less common than the pipe in modern usage is the comma. Comma as delimiter is so common that the conventional term for it, comma separated values or CSV, has become basically synonymous with tabular data in plain-text form. Comma has its upsides in that it's a familiar character and already has a related semantic meaning in natural language where it's used to punctuate lists, but has the serious disadvantage that it commonly occurs in text. Meaning that your comma-separated fields may have commas in them. These commas then need to be escaped.

Wikipedia, which as previously mentioned is never wrong, tells us that the used of a comma as a field delimiter was present in an IBM FORTRAN compiler in 1972. Further, some additional research suggests that FORTRAN (including FORTRAN 77 in which this feature was standardized) is also the source of the maddening "quoting" semantics that exist with CSV. That is, when I talked about escaping delimiters using escape sequences, I was describing a "modern" approach to the problem. CSV typically takes a different approach called "quoting," in which fields that contain the delimiter must be surrounded by quotes. The quotes do *not* demarcate the fields, though, only allow sections to contain the delimiter. This leads to some truly insane situations, where for example the string ,"", in a CSV field must be "quoted" as ","""",". \" isn't exactly ergonomic but """ manages to be worse.

Comma delimiters were a massive mistake, and attempts to formally standardize the format (e.g. various RFCs) generally only serve to illustrate how poorly defined "the CSV format" is, being basically a loose description of the nonstandard behavior of an early-'70s FORTRAN compiler. That said, the format became popular in business applications (because IBM used it) and was a natural "lowest common denominator" format for spreadsheet tools, so we are now pretty well stuck with it. Unfortunately, the term "CSV" is used so carelessly to describe so many things that it often requires careful handling. For example, when you open a CSV file in Excel it prompts the user to choose all kinds of parameters for how the file will be parsed. This is of course extremely user friendly.

Another once-common choice of "user-friendly" delimiter that has fallen out of popularity is tab. Files which use the tab as a delimiter are sometimes called tab-separated values or TSV. TSV has the advantages that the tab character is unlikely to appear in a field, but they loop directly back to a disadvantage of the dedicated ASCII field separator character but somehow make it worse. Tabs are not just non-printable characters, they are characters that induce context-specific behavior in the printer (to snap to the next 8char interval typically). This means that TSV files as printed or viewed in editors (unless the printer/editor handling of tab is modified) look extremely wacky and are usually even harder to look at than CSV files.

The point here is that encoding structured data in text brings up a very fundamental problem: structured data tends to *include* text, so that there is no clear delineation between symbols that encode the structure of data and the symbols that are the actual data. This inner conflict means that virtually all text-based encoding standards require some kind of escaping or quoting convention. This sometimes gets complex. For example, in HTML, there is both a symbol other than > that encodes > *and* a way to demarcate a section of text where the actual > is to be interpreted as not being part of the markup. Naturally this way of marking a section to not be interpreted as structure must itself not be interpreted as structure but also must have an end delimiter which will not occur in the non-structure data (so that it can represent structure in the data that cannot represent structure), and so has a syntax that is completely mind-numbing: <![CDATA[ ... ]]>.

### A bit about control characters

Having said all that about delimiters, let's talk a little bit about those first 32 ASCII characters. They are not all completely unused. For example, null or the 00 byte is in the ASCII character as, well, the NUL character. In null-terminated strings we tend not to think of the null as part of the string (and thus we don't think of it as "text"), but the ASCII coding allows us to view it as a part of the string if we want to.

Carriage return and line feed are also widely used to represent a new line (of course LF on Linux and CRLF on Windows, for historic reasons and just to inspire hate in us all). Backspace and EOF (end of file) are also used for their intended purposes in certain cases, but not really all that often---backspace over certain types of terminal connections and the EOF character is mostly only important on Windows as Unix chose a different architectural approach to handling the end of files.

But, more interesting, let's talk about that escape character. It is a long-running convention in both printer and video terminals to recognize special sequences beginning with the Escape key as "control sequences" which modify the behavior of the terminal. I am not sure where this originates, but DEC's first video terminal, the VT05 in 1970, behaved this way. IBM terminals of the same time period did not include "escape sequences" of the same fashion only because IBM took a radically different approach to video terminal interfacing which was not hampered by being a re-purposed telegraph and so provided a much more flexible way for the computer to communicate with the terminal. In general, IBM never really bought in to the basic "text in/text out" approach to terminals which was adopted by the mid- and minicomputer vendors primarily as a cost-saving measure, which is one of the fundamental philosophical divides between "big iron" and mid/minicomputing (e.g. "modern computers").

When these escape sequences were standardized by ANSI, they avoided collision with existing proprietary escape sequences by having all of their escape sequences start with the sequence ESC[. Not unrelievedly, the ESC character is conventionally represented in "control character" format as ^[, leading to ANSI sequences sometimes being represented in printable characters as ^[ or even ^[[. You may have seen these representations when you use the arrow keys on a terminal connected to a computer or software which, for whatever reason, does not understand the escape sequences and so echos them back as entered text. There are no ASCII characters for the arrow keys, remember, and so your terminal has to encode them in terms of ASCII characters using escape sequences. This all goes to highlight how much of a problem non-text-in-text and text-in-non-text and non-text-in-text-in-non-text gets to be.

To loop back around to relevancy, this is exactly the problem that markup languages face: they are used to annotate text, using the exact same symbols that constitute the text. To paraphrase von Neumann, anyone doing so is, of course, in a state of sin.

[1] If 80 columns seems familiar, yes, through a few steps of indirection not really dependent on FORTRAN, these cards are the reason that 80 characters is considered a standard width for terminals. More specifically, "interactive terminals" such as TTYs and video terminals are more or less based on "keypunches" which punched the holes in these cards, some of the later of which had 80 character wide displays on which they showed the entered data (the displays were simply easier and more ergonomic to read than the typed track on the card) and lead more or less directly to the invention of the video terminal. As a fascinating bit of design history, one early IBM "video terminal" used an 80x4 character CRT display, on top of which sat two angled mirrors, so that each of two separate operators saw an 80x2 field which allowed them to see the "card" they were entering and a status line. CRTs were very expensive at the time, this shared-tube design simply saved money. The allowance of a second line per operator for "status" is perhaps the inspiration for most later video terminals having some provision for a "status line" at the bottom of the screen. Both 80x24 and 80x25 are considered "conventional" terminal sizes because several popular terminals were 80x25 and allowed the bottom status line to be toggled on or off.

[2] If this seems a little crazy, keep in mind that early terminals and printers were electromechanical. Supporting only uppercase characters reduced the size of the type mechanism and the number of characters (and thus number of bits required to code for the characters), which could be a significant reduction in both the price and size of these devices. Further, early computer terminals were often modified teletypewriters (TTYs) which had used the baudot encoding, which includes only uppercase characters for the same reason, as well as to increase baud rate since symbols only needed to be five bits.
--------------------------------------------------------------------------------
>>> 2020-07-11 some formats

We've talked a little bit about markup languages. Broadly speaking, and to use a taxonomy which I completely made up by myself, most markup languages in use for data interchange today are either enclosure-style, in which each element is enclosed by start and stop delimiter (eg. HTML, XML), or key-value style, in which the file consists more or less of a list of keys and values which may be enclosed in various ways to indicate structures like maps and lists (e.g. YAML and JSON). Of course there are many others as well and I'm speaking only of data interchange here, not more general markup, but the point stands that these two families are mostly what we use today when we need to get structured data from one thing to another.

Just trying to organize things this way brings us to a somewhat complex question: what exactly is a markup language? My carefully constructed (in about thirty seconds while slightly inebriated) taxonomy happens to exclude, for example, markdown and RST, which would generally be called markup languages. This is partially because I'm just focusing only the things that are interesting to me in this case, but it's also partially because the concept of a markup language and/or a data interchange format are somewhat loosely defined.

Wikipedia, which is never wrong, says that "a markup language is a system for annotating a document in a way that is syntactically distinguishable from the text." This definition, on a plain reading, clearly includes HTML, Markdown, RST, and many others. Things get a little weird when we look at XML. It has Markup Language right in the name, and it can certainly be used in a fashion similar to HTML (see: the last post), but it often isn't. In cases like XML, and even more so with YAML, the argument that the markup is just an annotation on the text becomes a lot harder to defend. I would be tempted to refer to these as "data interchange formats" rather than "markup languages," but that term is already in use for something different. We could also call them "serialization formats" but people tend to associate that term more with binary formats. So the basic terminology is rather confusing here, and if I had a bit of common sense that's what I'd be trying to taxonomize.

The point of all of this is that I would like to talk a bit about formats which are used for interchanging data between different systems (or occasionally for storing and retrieving data within the same system). These are often called markup languages but are probably not really markup languages in that they do not focus on annotating (or marking up) text, instead they express data structures which may contain text but are not necessarily text documents. These are "markup?" languages like XML, YAML, JSON (this one doesn't call itself a markup language!), and various others. And specifically, I am talking about the ones that are text-based, as opposed to binary formats like protobuf and others.

It's very interesting to me to look at the history of how we got to our modern concept of data interchange formats. There is a surprising amount of homogeneity in most modern software. XML is very widely used but decidedly out of vogue with today's youths. JSON is perhaps the most widespread because it is (kind of) easy to use and (kind of) natively supported by JavaScript, but there are a surprising number of caveats to both of those. YAML is also quite common but surprisingly complex, and it has an uneasy relationship with JSON wherein JSON documents are also valid YAML documents but you should probably forget that. There are some upstarts like TOML and something called HOCON? But no one really cares.

As mentioned previously, XML dates back to roughly 1998. YAML came about in 2001, not that much later, but became popular probably more around the mid to late 2000s when it was viewed as the antidote to XML's significant complexity. Most people don't realize that YAML is probably just as complex, because it looks very simple in the minimal examples that most people constrain themselves to.

XML has SGML as an antecedent, and SGML is derived from IBM formats which date back to 1970 or so. Interestingly, this ancient ancestor of XML (called GML, because it was before Simple GML), has a certain superficial resemblance to YAML, at least in that it involves significant use of colons. That's a bit interesting as YAML does not have any clearly described ancestors.

So how does GML work? Well, it worked much like SGML in having start and end tags, but tags were started with a colon and ended with a period, rather than using the greater than/less than symbols. But GML also had a very strong sense of being line-oriented, that is that tags generally went on their own line, which is a bit more similar to YAML than to SGML.

In fact, the great bulk of early data interchange formats were line-oriented. There are various reasons for this, chief among them that it is simply intuitive to put "one record per line," as it matches conventional tabular formats that we're familiar with in print (e.g. tables). It was also essentially a technical constraint of punched-card based computer systems, where "line" and "file" (in the modern sense) were more or less equivalent to "card" and "stack" when working with punched cards---that is, each card was considered a line of text. That each card could be called a "record" and a set of records made up a file shows the degree to which electromechanical punched card systems, and the computers derived from them, were intended to model pre-computer business records kept as lines in ledgers.

Overall I have found it extremely difficult to trace any kind of coherent history of these formats, which is probably reflected in how disorganized this message is. Many old data interchange formats have familial resemblances to each other, giving the tantalizing suggestion that a "family tree" could be traced of which were based on which others, but actually doing this would probably require a great deal of original research and I have both a full-time job and hours of standing in the living room staring at the wall to keep up with, so while I have made some tentative forays into the matter I do not expect to publish a treatise on the origins of XML any time soon.

Instead, I would like to mention just a few interesting old data interchange formats and some things we can learn from them. Most of these examples are old, all of them come from a context in which a body of experts attempted to design a single, unified data model sufficient to meet all the needs of a given problem domain. This has profound implications. I have said before and I will say again that computer science is the discipline principally concerned with assigning numbers to things. In the realm of computer science (and specifically AI, in the original meaning of AI, not the marketing buzzword of today) research, the term "ontology" is borrowed from philosophy to refer to defining the nature of things. That is, ontologists in CS do not seek to establish what *is*, they seek to *represent* what is. This is perhaps the highest-level academic discipline of assigning numbers to things and deals with fundamental and theoretical questions about how computer systems can represent and manipulate complex domains of knowledge. While the ontologists of philosophy ponder what does and can exist, the ontologists of computer science ponder how to punch all of that onto paper cards.

XML is not exactly a masterpiece of ontology, but there is a whiff of ontology throughout the world of data interchange formats. Designing a domain-specific interchange format requires considering all of the areas of knowledge in that domain and assigning codes and keywords to them. Designing generalized interchange formats requires considering all of the *structures* of knowledge that need to be expressed. Because the set of data structures in use by computer systems is in practice highly constrained by both the limits of technology and the limits of the people who use the technology (essentially everything in life is either a map or a list, regardless of what your professors told you about bicycles and inheritance), it seems that in practice creating a generalized markup language is almost the easier of the two efforts. At least JSON is really dead simple. Of course, for generalized languages which support schemas, schemas tend to bring in domain-specific knowledge and all the complexities thereof.

So let's forget about generalized markup languages for now and jump back to a time in which generalized markup languages were not in widespread use and most software systems exchanged data in domain-specific formats. These domain-specific formats were often being developed by domain experts using very careful consideration of everything which may need to be represented. We see in this pursuit both complex theoretical problems in computer science and the ways in which large parts of computer science (generally the more applied assigning of numbers) are derived from information or library science.

That was an extremely long preamble to get to the actual point of this message, but hopefully it provides a bit of context into why I am about to tell you about MARC.

If I am to argue that we can blame large parts on computer science on library science, MARC is my key piece of evidence. Librarians and other information science types are deeply concerned withe the topic of "authority control," which is basically about being able to uniquely identify and look up information based on standardized names. A book ought to have one title and one author (or set of authors) which can consistently be used to look it up, even though people are prone to use abbreviations and write names in different ways. A similar problem is seen in genealogy where the spelling of family names often drifts from generation to generation, but researchers tend to consider "McLeod" and "MacLeod" to be the same name despite the variable spelling. You could argue that when Google corrects your spelling errors it is practicing a form of authority control by standardizing your query to the authorized vocabulary.

Yes, authority control tends to be based around the idea of establishing a restricted vocabulary of standardized, or authorized, names. J. R. R. Tolkien, John Ronald Reuel Tolkien, and my insistence on misspelling it J. R. R. Tolkein ought to all be standardized to the same authorized name, so that a query for any of these representations returns all of his books. "Tolkien, J. R. R." according to the library catalog. This idea of a standardized, constrained vocabulary will be familiar to anyone in computing as it's the same kind of thing  we have to think about when dealing with computers. MARC rests at exactly the intersection of the two.

MARC is short for Machine-Readable Cataloging. It was developed for the Library of Congress in the 1960s for the purpose of representing the library catalog in computer form. It is still in fairly common use today as a "lowest common denominator" interchange format between library cataloging software developed by different vendors. While there is an XML variant today, MARC is most widely seen in its original, 1960s format that looks like this:

005    20180917152453.0
008    180410b ||||| |||| 00| 0 eng d
020 _c EC$20.00 (cased).
100 _a Tolkien, J.R.R.
245 _a The silmarillion /
    _c J.R.R. Tolkien ; edited by Christopher Tolkien.
260 _a London :
    _b Book Club Associates,
    _c c1977.
300 _a 365 p. ;
    _c 23 cm.
500 _a Includes index.
650 _a Baggins, Bilbo
    _v Fiction.
650 _a Middle Earth (Imaginary place)
    _v Fiction.
    _9 36397

Of course, this is not exactly what it looks like. This is in part because I have omitted certain fields to make it more readable, but it's more so because the standard representation of MARC makes use of non-printable ASCII control characters to separate fields, and not the newline. I have swapped out these control characters for newlines and spaces and then indented to make things more clear. I have also omitted some junk that comes out of the details of the format such as a bunch of extra slashes. The point is that I have made this format look tremendously more human-friendly than it actually is.

MARC consists of fields, each identified by a three-digit number. Fields may have subfields, identified by a letter. For example, field 245 is Title Statement. Subfield A is Title, subfield C is "statement of responsibility, etc." according to the LoC documentation. Not all of these fields make so much sense. Field 008 is called "fixed-length data elements" and is part of the control fields (00x fields). It contains things like date the book was added to the catalog, where the catalog data came from, but also some less control-ey data like "target audience." But all of this is combined into one field using a fixed-width format, and the pipe is for some reason used as a "fill" character for fields which are required but have no data.

This idea of enumerating every field that might need to be expressed and then assigning numerical codes to them is a common aspect of early data interchange formats. I will show one other example before ending this rather long message and leaving more for later. That's a 1980s-vintage format that I have the pleasure of dealing with in my current day job, called Health Level 7 or HL7. HL7 serves as a "lowest common denominator" format for exchange of data between different electronic health record systems. An example HL7 record, courtesy of Wikipedia, follows, but note that I have removed some fields for brevity.

MSH|^~\&|MegaReg|XYZHospC|SuperOE|XYZImgCtr|20060529090131-0500||ADT^A01^ADT_A01|01052901|P|2.5
EVN||200605290901||||200605290900
PID|||56782445^^^UAReg^PI||KLEINSAMPLE^BARRY^Q^JR||19620910|M||2028-9^^HL70005^RA99113^^XYZ|260 GOODWIN CREST DRIVE^^BIRMINGHAM^AL^35209^^M~NICKELL’S PICKLES^10000 W 100TH AVE^BIRMINGHAM^AL^35200^^O|||||||0105I30001^^^99DEF^AN
OBX|1|NM|^Body Height||1.80|m^Meter^ISO+|||||F
OBX|2|NM|^Body Weight||79|kg^Kilogram^ISO+|||||F
AL1|1||^ASPIRIN
DG1|1||786.50^CHEST PAIN, UNSPECIFIED^I9|||A

If we can stop chuckling at "Nickell's Pickles," we can see that this looks very different from MARC but there is a similar phenomena going on. Each line is a field with components separated by pipes. The first component is a three-character (but now alphanumeric) field ID. MSH identifies message type, PID is patient identity. Each of these is separated into many subfields, in the case of PID we can make out an ID number, a name, date of birth, etc. Once again, the same basic concept of code-identified fields with various subfields, and once again represented as one field per line. This time, mercifully, the field separator is newline and the subfield separator is pipe. These are conveniently human-readable so I have not had to replace them with whitespace. Finally, we once again have the use of odd filler symbols, mainly ^.

^ needs to be used basically because of a limitation in the data model, there is no way to separate "subsubfields." Consider the address. "260 GOODWIN CREST DRIVE" has a space in it, spaces are quite acceptable. But the EHR in use, like most software, feels the need to separate components of the address into tidy fields. Space can't be used to separate subsubfields because it's used within the subfields. Newline can't be used because it's the field separator. So instead, ^ is used. Further, both ^ and ^^ are used to represent subsubfield separations of different orders. "BIRMINGHAM^AL" is essentially equivalent to "BIRMINGHAM AL" except that the use of ^ rather than space assures the parser that it is the separator between city and state, not a space within the name of the city. Humans are largely smart enough to figure out that there is probably no city called "Birmingham Al" and so the "AL" must be a state, but computers are not.

Alright, I'm going to try to stop talking now. But I want to follow up in a future post by going on at length about fixed-width fields and their long heritage, and also perhaps about the pipe as a field separator, which is something that's very widely seen in early (say pre-1995) formats but rarely seen today. That will bring me to the matter of the comma as a field separator, something that is in fact very common today and has turned out to be a monumental pain. Finally, I'll loop back to those ASCII control characters that MARC used and I removed for you, and wonder why no one uses them today.
--------------------------------------------------------------------------------
                                                                        older ->