Matthew Paul Thomas

When semantic coders go bad

Lachlan Cannon, Matt May, and James Craig disagree with my statement that HTML’s b and i elements should be used to achieve bold and italic appearance for meanings that don’t have an equivalent semantic HTML element.

Lachlan Cannon’s explanation of how b and i should be avoided is immediately followed by a form for posting comments. The form’s instructions say (and I am not making this up): To emphasise text surround it with [i] and [/i]. To strongly emphasise text, surround it with [b] and [/b].

James Craig’s comment system doesn’t even offer a warning. When I posted a response to his plea Don’t fall back on b and i!, all the semantic markup I had used in my comment was automatically removed. The only markup allowed to remain was … b and i.

And there’s more irony where that came from

Humans are visual creatures, and when writing a document, most of us think about the words on the page — or on the screen — rather than about the underlying semantics. Most of us can’t even be bothered setting up visual styles in Microsoft Word. So expecting everyone to think about whether they’re using italics to represent emphasized text, or a citation, or a defined term, or a mathematical variable, or something else, is quite unrealistic. Thinking about semantics requires an unusual kind of intelligence, and an unusual degree of obsession.

The genius of HTML, so far, is that it has allowed people to create documents that are reasonably semantic without having to think about it. When someone asks their authoring software for bulleted items, for example, they get underlying code saying this is an unordered list, and Google Sets gets more data without any extra effort on their part. But when people want a particular presentation (such as bold or italics) that has more than one meaning, and there are semantic elements for more than one of those meanings, many people will choose the wrong element. It’s inevitable. It’s better to let these people default to semantically meaningless markup (like b and i), instead of using the wrong semantic element and fouling up software that distinguishes between such elements.

Matt May’s complaint that this is not the right way for a semantic HTML coder to look at things is entirely correct, but almost entirely irrelevant. For HTML to be the lingua franca for publishing hypertext on the World Wide Web, it must also cater for the ~99 percent of Web authors who don’t care about semantics and never will.

That last quote, by the way, is from the W3C’s page on HTML and XHTML. I had to correct the W3C’s markup, because they abused em to italicize the phrase lingua franca. And that’s not an isolated mistake: elsewhere on that page strong is abused to mean samp and to mean dt, and there are dozens of missing abbr tags. If even the World Wide Web Consortium can’t get semantic markup right, it is unreasonable to expect everyone else to do so.

An unrequited love for `span`

So how should you mark up text in another language? Here’s what I used in my original example:

je ne sais quoi

Instead, Matt May suggests

je ne sais quoi

and, in his style sheet, italicizes any span with a lang attribute using a simple attribute selector.

These two options have exactly the same semantics. The only difference is that Mr May’s version doesn’t work in Internet Explorer for Windows. Or in Internet Explorer for Mac. Or in Opera in User Mode. Or in Mozilla with Basic Page Style selected. Or in Firefox with Basic Theme selected. Or in Lynx.

But that’s okay, according to Lachlan Cannon, because it means Internet Explorer for Windows, Internet Explorer for Mac, Opera, Mozilla, Firefox, and Lynx are all old, and broken browsers, and surely Matthew isn’t going to suggest we should pander to broken browsers? Right on, brother! The latest development version of Lynx was released two whole days ago. Everyone using a browser that ancient should, apparently, get off the Web and leave it for the rest of us.

For other bold and italic items that don’t have their own semantic elements, Lachlan, Matt, and Craig all recommend . Again, this is just as semantically meaningless as my  and . But wait, they say, you can style the class in a span! Yes, but you can style the class in a b or i just as easily. (You could even neutralize the bold or italics, if your extra styling made them superfluous.) Again, the only difference in using span is that it works for fewer readers.

The myth of deprecation

If these people don’t care about their pages working in real-world browsers, what do they care about? Sometimes they claim to care about forward compatibility, based on false impressions about what is in the W3C specifications. For example, James Craig says, Presentational markup like b and i have been deprecated. That statement is false.

In HTML 4.01, these are all the deprecated elements: applet, basefont, center, dir, font, isindex, menu, s, strike, and u. That’s all. b and i are not deprecated.

In XHTML 1.0, these elements are newly deprecated: none at all. b and i are not deprecated — they’re even in the Strict version. The only thing newly deprecated is the name attribute.

In XHTML 1.1, these elements are newly deprecated: none at all. b and i are not deprecated — they’re in the presentation module. What XHTML 1.1 does do is remove the name attribute and … ooh look, it removes the lang attribute! The very attribute Matt May suggested we use with span! So much for that forward-compatibility schtick, eh?

Ah, but what about XHTML 2.0? Matt says: The  and  elements convey no semantics. They are purely presentational elements. (Which is why they’re missing from XHTML 2.)

Yes, in the current XHTML 2.0 draft, b and i are not deprecated — they’re dropped without explanation. But that can’t be, as Matt asserts, because they convey no semantics and are purely presentational. Because the current draft still includes the sub and sup elements, which also convey no semantics and are purely presentational. (The draft does suggest an alternate rendering for teletype-like devices; but Lynx already does an alternate rendering for i, and that doesn’t magically make i a semantic element.) Even in MathML, subscript and superscript are supported using presentational elements for situations where the semantic elements do not apply. Like bold and italics, subscript and superscript have many different meanings, and it would be impractical to specify them all in a markup language.

Since XHTML 2.0 is only a draft, subject to change, and is unlikely to be widely used before the end of the decade, if ever, it seems rather absurd to consider forward compatibility with XHTML 2.0 as more important than having current documents work in current Web browsers. If you’re really aiming for such forward compatibility, for example, you’d better not be using the img element. (Like Matt May does on every single page of his Weblog.)

And markup to go before I sleep

Markup is hard. You can see this in the number of people who forget to use titles. You can see it in the number of people who think the alt attribute is a tag. You can see it in the number of people who treat text-only readers as second-class citizens by using alt to describe images instead of replacing them. You can see it in the number of people who, having read that tables for layout are bad, misremember it as tables are bad and try to construct oddities like tableless calendars. And you can see it in the number of syntaxes created as simpler alternatives to HTML, like BBCode, Markdown, Textile, Wiki syntax, and reStructuredText.

Adding semantics to markup makes a well-written page more useful, which makes the Web more useful. But making it hard not to use those semantics makes bad semantics more likely, which makes the Web less useful. Ideally XHTML will find a balance so that the Web as a whole is as useful as possible. Unfortunately, since those working on the XHTML 2.0 specification know a lot about markup, they will likely underestimate how hard it is, and they will do things that make the Web as a whole less useful. Things like removing the b and i fallbacks for people who would otherwise use bad semantics. I hope I’m wrong about that.

2004-05-19: More from Joe Clark.

This entry was posted on Sunday, May 9th, 2004 at 4:45 am and is filed under Computing & Internet. You can follow any responses to this entry through the RSS 2.0 feed. Both comments and pings are currently closed.

Comments are closed.

When semantic coders go bad

And there’s more irony where that came from

An unrequited love for span

The myth of deprecation

And markup to go before I sleep

An unrequited love for `span`