simple generalized message

2020-06-27

As the local news warned us in the early 2000s, the internet is a scary place full of hidden dangers. One of these is HTML.

Let's begin this discussion of the internet's favorite markup languages with just a quick bit about XML. XML, or Xtensible Markup Language, is a complicated markup language which is highly popular with enterprise software and Microsoft. More seriously, XML was introduced in mid-'90s as a highly standardized markup language which could be used for a wide variety of different purposes while still being amenable to consistent parsing and validation. This was achieved by making XML "extensible" in the sense that multiple schemas and document type definitions (DTDs) can be used and combined to allow XML to express nearly anything---while still being conformant to a well-defined, standard schema. But we're not here to talk about XML.

Conforming to a well-defined, standard schema is a lot of work and not very fun, so naturally XML has fallen out of fashion. First, the community favored YAML over XML. YAML is a markup language which appears, on first glance, to be very simple, but as soon as one looks beneath the surface they discover a horrifying Minotaur's labyrinth of complex behavior and security vulnerabilities in the making. Partially in response to this problem but mostly in response to the community losing interest in every development target that isn't Google Chrome, YAML itself has largely fallen out of favor and been replaced by JSON, except for all of the places where it hasn't. Also there is TOML. But we're not here to talk about markup languages either.

We're going to talk about HTML.

Computers Are Bad pop quiz: do you, in your heart of hearts, believe that HTML is a form of XML?

If you answered yes, you are wrong. But, you are wrong in a very common way, which seems to be rather influential. That is what we're here to talk about.

It's actually kind of clear on the face of it that HTML is not derived from XML. The first XML specification was published in 1998; depending on how you look at it HTML was first in use somewhere between 1990 and 1995. In fact, both HTML and XML are derived from a now largely-forgotten standard called SGML, or Simple Generalized Markup Language, which traces its history back several decades before HTML or XML. HTML and XML have a familial resemblance because they are siblings, not parent and child.

This has some interesting implications. To really get at them, we need to look a little bit at SGML. The following is a valid SGML snippet:

&lt;object&gt;
	&lt;Item&gt;one&lt;/&gt;
	&lt;Item&gt;two&lt;/&gt;
&lt;/object&gt;

The following is a valid HTML4 snippet:

&lt;ul&gt;
	&lt;li&gt;Item one
	&lt;li&gt;Item two
&lt;/ul&gt;

These look very similar but---HOLD ON A MOMENT---the SGML version has some weird </> business going on, and in the HTML version the li (list item) elements are just dangling with nothing on the other side! I am exaggerating as to the level of shock here, but my impression is that a lot of people with a moderate to even professional level understanding of HTML would be surprised that this is valid.

When XML was designed based on SGML, one of the explicit goals was actually to make the language simpler and easier to parse. This might be a surprise to anyone who has ever interacted with an XML parser. But the reality is that XML is easier to parse because it has a much stricter definition that makes XML documents more consistent from one to the next. One of these strict rules is that, in XML, all elements must be explicitly closed. This is a new rule introduced by XML: in SGML, there is not only a compressed syntax to close an element (</>) but closing elements is often optional. No closing tag is required at all if the parser can infer that the element must have closed from the context (specifically when a new element starts which cannot be nested in the prior).

We do something pretty similar in English. When we tell stories, we typically omit stating that we stopped doing something because most of the time that can be inferred from the fact that we started doing something else. This works well because humans have an especially sophisticated ability to interpret natural language using our understanding of the world. Computers do not understand that the world exists, so enforcing very strict rules on the construction of languages makes it easier for computers to understand them. This would all be basic knowledge to anyone with a CS degree and/or who has heard of Noam Chomsky as a linguist rather than as a socialist, but it's still pretty interesting to think about. As a general rule, the better a language is for a computer, the worse it is for humans!

So XML made the decision to be annoying to humans (by requiring that you explicitly state many things that could be inferred) in order to make parsers simpler. HTML, being derived from SGML instead, requires that parsers be more sophisticated by allowing authors to elide many details.

Perhaps you can imagine where this goes wrong. In fact, for various reasons that range from loose specifications to simple lack of care, HTML parsers were both extremely complex and extremely inconsistent. This reached a peak in the late 2000s as many webpages either only worked properly in certain web browsers or had to include significant markup dedicated to making single web browsers function properly. While there was some degree of blame all around, Microsoft's Internet Explorer was the main villain both because its developers had a habit of introducing bizarre non-standard features and because Microsoft is fundamentally hateable. Because of MSIE's large market share, the de facto situation was that many webpages functioned properly in MSIE but not in, say, Netscape Navigator, err, uhh, Firefox, even though Firefox was the browser that did a better job of adhering to the written standards.

This situation led to a fairly serious backlash in the web community. While some things of real import happened like an EU antitrust case, more significantly, it became fashionable to declare in the footer of websites that they were Standards Compliant. Yes, admit it, we are all guilty here.

But something else rather interesting happened, and that's XHTML. In the late '90s, work started on a new variant of HTML which would actually be based on XML, and not on SGML. This had the advantage that XML parsers were simpler, and so web browser HTML parsers could be simpler, more consistent, and have better and more consistent handling of errors. At the time, essentially no one cared, but as the browser wars escalated a more consistent specification for HTML, which was more amenable to exact parsing and machine validation, started to look extremely tempting.

Further adding to XHTML's popularity, the same time period was a high point in interest in the "semantic web." Because XHTML is Xtensible, arbitrary XML schemas could be embedded in XHTML documents to semantically express structured data for machine consumption, along with presentation logic for display to humans. This is the kind of thing that sounds extremely cool and futuristic and no one actually cares about. The Semantic Web was much discussed but little implemented until Google and Facebook started imposing markup standards which were significantly less elegant but required for good search rankings and/or native social media traffic, and so many SEO consultants transitioned from adding paragraphs of invisible text in the footer to adding weird meta tags to the header in order to look better in the Facebook feed. Now that is computing technology.

Most people who learned HTML in the 2005-2015 time period actually learned XHTML, and may not realize it. That's why, today, they strictly close all of their elements, including the empty ones.

This whole thing is made sort of funny by the fact that XHTML was rather short-lived. The release of the HTML5 specification in 2014 largely addressed all of the shortcomings of the HTML4.1 specification, and obsoleted XHTML. Part of this is because HTML5 was the shiny new thing, part of it is because HTML5 largely integrated the features of XHTML in a more convenient fashion than XHTML, and part of it is because XML was very popular with Microsoft who is extremely hateable.

In the end, XHTML is essentially forgotten today, very quickly in internet terms although surely there are still plenty of websites out there written in it and not yet updated. Perhaps the bigger influence of XHTML is that all we Millenials are running around closing all of our elements explicitly, which is considerably ironic in a world where we are omitting whitespace from our JavaScript to save bytes. In fact, in a quick survey of HTML5 minifiers, most don't seem to remove unnecessary closing tags.

Of course, HTML parsers being what they are, it's guaranteed that there are parsers in use which will malfunction when presented with these completely standards-compliant documents! I love parsing.