John Resig - HTML 5 Parsing

HTML 5 Parsing

One of the biggest wins of the HTML 5 recommendation is a detailed specification outlining how parsing of HTML documents should work. For too many years browsers have simply tried to guess and copy what others were doing in hopes that their parser would work well enough to not cause too many problems with HTML markup found in the wild.

While some parts of HTML 5 are certainly more contentious than others – the parsing section is one that is almost universally appreciated by browser vendors. Once browsers start to implement it users will enjoy the improved compatibility, as well.

One of the first implementations of the HTML 5 parsing rules was actually created to power the HTML 5 validator. (If you’re interested in testing it out, https://johnresig.com/ should validate as HTML 5.) This particular implementation is in Java, provides SAX and DOM interfaces for use, and is open source.

This is particularly interesting because Henri Sivonen (the author of the validator) just recently landed (Warning: Massive web page) a brand new HTML 5 parsing engine in Gecko, destined for the next version of Firefox.

What’s interesting about this particular implementation is that it’s actually an automated conversion of Henri’s Java HTML 5 parser to C++. This conversion happens automatically and changes will be pushed upstream to the Mozilla codebase.

Normally I would balk at the mention of a wholesale, programmatic, conversion of a Java codebase over to C++ but the results have been very surprising: A 3% boost in pageload performance.

And this is on top of the litany of bug fixes and compliance checks that this code base will be providing. You can examine some of the progress that went into the constructing the patch in the Mozilla bug.

If you’re interested in giving the new parser a try (it’s doubtful that you’ll see many obvious changes – but any help in hunting down bugs would be appreciated) you can download a nightly of Firefox, open about:config, and set html5.enable to true.

If there was ever a time to start playing around with the jump to HTML 5, now would be it. Since HTML 5 is a superset of the features provided by HTML 4 and XHTML 1 it ends up being surprisingly easy to ‘upgrade’: Just start by swapping out your current (X)HTML Doctype for the HTML 5 Doctype:

<!DOCTYPE html>

From there you can check the site HTML 5 Doctor for additional details on how to get the new HTML 5 elements working in all browsers.

Posted: July 7th, 2009

Subscribe for email updates

13 Comments (Show Comments)

Ryan McGrath (July 7, 2009 at 11:19 pm)

I have to say, I really like that every time a new version of Firefox releases, I’m looking forward to the next version because of some cool new feature. Never gets boring.

Being that it’s open source… any idea if Webkit will adopt it as well, or if they plan on rolling their own?
Robert O'Callahan (July 7, 2009 at 11:38 pm)

The coolest thing about the new HTML5 parser is the ability to use SVG and MathML inline in regular HTML.
orip (July 8, 2009 at 1:26 am)

How is the automatic Java -> C++ conversion performed? Sounds interesting.
Mathias Biilmann (July 8, 2009 at 4:23 am)

Might interest you that I used your javascript xml parser as a tokenizer for a limited implementation of the HTML5 parsing algorithm in javascript.

It’s far from complete and not intended to be. The code has been written following the specified algorithm very directly, with a few different modes each with ther case statements for open tags and closing tags.

The usefuleness of this html5 parser is mainly to turn broken html into clean and wellformed xhtml when a user copy pastes a Microsoft Word file into a designMode based rich text editor.

http://github.com/biilmann/javascript-xhtml-purifier/tree/master

Regards,
Mathias
John Resig (July 8, 2009 at 6:51 am)

@Ryan McGrath: Not sure what WebKit will do here – it certainly would be pretty awesome if they decided to go off the same codebase (guaranteed inter-compatibility!).

@Robert O’Callahan: That is pretty cool – and something that Mozilla has been shooting for, for a while. Is there a list of bugs/features that this parser is resolving?

@orip: It looks like it’s a Java program specially tailored to the codebase.

@Mathias: Nice work – looks interesting!
Neal G (July 8, 2009 at 8:22 am)

How does IE6 & older browsers handle the HTML5 doctype? Does it throw them into quirks mode?
John Resig (July 8, 2009 at 8:37 am)

@Neal G: Good question! Did some digging and found an article by Henri (he’s everywhere!) detailing the doctype breakdown by browser: http://hsivonen.iki.fi/doctype/

It looks like the HTML 5 Doctype puts IE 6 into “Almost Standards Mode” (which, based upon the table, is the best that we can hope for).
Jostein Kjønigsen (July 8, 2009 at 9:06 am)

@Robert O’Callahan:

http://hsivonen.iki.fi/xhtml2-html5-q-and-a/

This link seems to disagree. Inline SVG and MathML is only allowed in XHTML5 with the correct mime-type set by the server (application/xml+xhtml).
voracity (July 8, 2009 at 9:27 am)

@Jostein: Actually, Henri in that link says:

You will be able to use SVG and MathML inline in text/html once browsers upgrade their parsers.

And that is what has just happened with the Firefox nightlies.
Adam Gordon (July 8, 2009 at 10:49 am)

I would argue that the biggest problem is, if not now it will be soon, the molasses-like adoption of better quality browser versions by larger companies.

Case and point:

Of the 89% of our partner’s customers that utilize our service, over 50%, 5.1M hits last month, still use IE 6. In fact, at my last job, I remember when IE 7 came out, we were told explicitly to NOT upgrade due to incompatibilities between our company’s web portal and that version.

I feel this excessively-cautious approach, at some point, is going to greatly hinder web development.
Jonathan Watt (July 8, 2009 at 11:12 am)

@Jostein: that statement is about “*released* versions of Firefox, Opera, Safari or Chrome” (emphasis mine), not the new HTML5 parser under discussion.
Ade (July 8, 2009 at 10:13 pm)

Adam Gordon said, “I feel this excessively-cautious approach, at some point, is going to greatly hinder web development.” There’s no question that it has already enormously hindered web development. My hope is that we’ll see more killer apps that rely on HTML5 features that simply don’t work whatsoever in any version of IE, and give people a more powerful motivation to switch.

Any time I have to do testing in IE I am absolutely taken aback that anyone would use that browser when every alternative is better.
Jonas Sicking (July 9, 2009 at 1:04 pm)

We definitely need help testing the new parser. The main concern we have regarding the HTML5 parser is that nobody knows how compatible the algorithm in the HTML5 spec is with the web.

We know it’s very good, but we don’t yet know if it’s good enough. The only way to find out is though people testing it.

So please please enable the HTML5 parser and report any compatibility problems. You don’t need to restart when changing the pref so it’s easy to see if it’s the HTML5 parser that is causing problems. Simply disable it and reload the page to see if that works.

So far only one problem has been reported:
https://bugzilla.mozilla.org/show_bug.cgi?id=502984

I’m sure there’s more out there :)

Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.