May 5th, 2008
Recently I was having a little bit of fun and decided to go about writing a pure JavaScript HTML parser. Some might remember my one project, env.js, which ported the native browser JavaScript features to the server-side (powered by Rhino). One thing that was lacking from that project was an HTML parser (it parsed strict XML only).
I've been toying with the ability to port env.js to other platforms (Spidermonkey derivatives and the ECMAScript 4 Reference Implementation) and if I were to do so I would need an HTML parser. Because of this fact it became easiest to just write an HTML parser in pure JavaScript.
I did some digging to see what people had previously built, but the landscape was pretty bleak. The only one that I could find was one made by Erik Arvidsson - a simple SAX-style HTML parser. Considering that this contained only the most basic parsing - and none of the actual, complicated, HTML logic there was still a lot of work left to be done.
(I also contemplated porting the HTML 5 parser, wholesale, but that seemed like a herculean effort.)
However, the result is one that I'm quite pleased with. It won't match the compliance of html5lib, nor the speed of a pure XML parser, but it's able to get the job done with little fuss - while still being highly portable.
htmlparser.js:
4 Libraries in One!
There were four pieces of functionality that I wanted to implement with this library:
A SAX-style API
Handles tag, text, and comments with callbacks. For example, let's say you wanted to implement a simple HTML to XML serialization scheme - you could do so using the following:
var results =
"";
HTMLParser("<p id=test>hello <i>world", {
start: function( tag, attrs, unary ) {
results += "<" + tag;
for ( var i = 0; i < attrs.length; i++ )
results += " " + attrs[i].name + '="' + attrs[i].escaped + '"';
results += (unary ? "/" : "") + ">";
},
end: function( tag ) {
results += "</" + tag + ">";
},
chars: function( text ) {
results += text;
},
comment: function( text ) {
results += "<!--" + text + "-->";
}
});
results == '<p id="test">hello <i>world</i></p>"
XML Serializer
Now, there's no need to worry about implementing the above, since it's included directly in the library, as well. Just feed in HTML and it spits back an XML string.
var results = HTMLtoXML("<p>Data: <input disabled>")
results == '<p>Data: <input disabled="disabled"/></p>'
DOM Builder
If you're using the HTML parser to inject into an existing DOM document (or within an existing DOM element) then htmlparser.js provides a simple method for handling that:
// The following is appended into the document body
HTMLtoDOM
("<p>Hello <b>World", document
)
// The follow is appended into the specified element
HTMLtoDOM("<p>Hello <b>World", document.getElementById("test"))
DOM Document Creator
This is a more-advanced version of the DOM builder - it includes logic for handling the overall structure of a web page, returning a new DOM document.
A couple points are enforced by this method:
- There will always be a html, head, body, and title element.
- There will only be one html, head, body, and title element (if the user specifies more, then will be moved to the appropriate locations and merged).
- link and base elements are forced into the head.
You would use the method like so:
var dom = HTMLtoDOM("<p>Data: <input disabled>");
dom.getElementsByTagName("body").length == 1
dom.getElementsByTagName("p").length == 1
While this library doesn't cover the full gamut of possible weirdness that HTML provides, it does handle a lot of the most obvious stuff. All of the following are accounted for:
- Unclosed Tags:
HTMLtoXML("<p><b>Hello") == '<p><b>Hello</b></p>'
- Empty Elements:
HTMLtoXML("<img src=test.jpg>") == '<img src="test.jpg"/>'
- Block vs. Inline Elements:
HTMLtoXML("<b>Hello <p>John") == '<b>Hello </b><p>John</p>'
- Self-closing Elements:
HTMLtoXML("<p>Hello<p>World") == '<p>Hello</p><p>World</p>'
- Attributes Without Values:
HTMLtoXML("<input disabled>") == '<input disabled="disabled"/>'
Note: It does not take into account where in the document an element should exist. Right now you can put block elements in a head or th inside a p and it'll happily accept them. It's not entirely clear how the logic should work for those, but it's something that I'm open to exploring.
You can test a lot of this out in the live demo.
While I doubt this will cover all weird HTML cases - it should handle most of the obvious ones - at least making HTML parsing in JavaScript feasible.
Tags: rhino, html, javascript, parsing
30 Comments on 'Pure JavaScript HTML Parser'
July 9th, 2007
This weekend I took a big step in upping the ante for JavaScript as a Language. At some point last Friday evening I started coding and didn't stop until sometime mid-Monday. The result is a good-enough browser/DOM environment, written in JavaScript, that runs on top of Rhino; capable of running jQuery, Prototype, and MochiKit (at the very least).
The implications of this are phenomenal, and I'm not the only one who's interested in it what this could mean for server-side JS development. More on that in a minute, but first here's some sample results from running jQuery:
jQuery
$ java -jar build/js.jar
Rhino 1.6 release 6 2007 06 28
js> load('build/runtest/env.js');
js> window.location = 'test/index.html';
test/index.html
js> load('dist/jquery.js');
// Add pretty printing to jQuery objects:
js> jQuery.fn.toString = DOMNodeList.prototype.toString;
js> $('span').remove();
[ <span#å°åŒ—Taibei>, <span#å°åŒ—>, <span#utf8class1>,
<span#utf8class2>, <span#foo:bar>, <span#test.foo[5]bar> ]
// Yes - UTF-8 is support in DOM documents!
js> $('span')
[ ]
js> $('div').append('<span><b>hello!</b> world</span>');
[ <div#main>, <div#foo> ]
js> $('span')
[ <span>, <span> ]
js> $('span').text()
hello! worldhello! world
On a whim, I then plugged in Prototype and MochiKit, both of which appeared to work OK (I haven't done any significant testing with them - so there's probably gaps). Here's some sample results:
Prototype
$ java -jar build/js.jar
Rhino 1.6 release 6 2007 06 28
js> load('build/runtest/env.js');
js> window.location = 'test/index.html';
test/index.html
js> load('prototype.js');
js> $$('div p')
<p#firstp>,<p#ap>,<p#sndp>,<p#en>,<p#sap>,<p#first>
js> Object.toJSON({foo:'bar',baz:true});
{'baz': true, 'foo': 'bar'}
js> var fn = (function(name,msg){
print(name + ' ' + msg); }).curry('John');
js> fn('hello!');
John hello!
MochiKit
$ java -jar build/js.jar
Rhino 1.6 release 6 2007 06 28
js> load('build/runtest/env.js');
js> window.location = 'test/index.html';
test/index.html
js> load('Mochikit.js');
js> $$('div')
<div#main>,<div#foo>
js> document.body.innerHTML = '';
js> document.body.appendChild( P( 'test',
A({href:'http://google.com/'}, 'link')) );
js> document.body.innerHTML
<p>test<a href='http://google.com/'>link</a></p>
js> $$('a')
<a>
I just want to emphasize that these are un-modified copies of jQuery, Prototype, and MochiKit - all running perfectly in this un-natural environment.
When I came up with this idea for an environment, I was mulling over a couple ideas: Namely, better ways of automating tests and ways to bring JS-style DOM/HTML interaction to the server-side. Having a way to bring this popular idiom to established problem sets seemed like a lot of fun.
In short, the following (at the very least) can all get a big dose of JavaScript:
- Automated Testing
- Screen Scraping
- Web Application Development
Now, if you think I'm crazy, I'd like to show you a couple quick examples:
Automated Testing
$ java -jar build/js.jar
Rhino 1.6 release 6 2007 06 28
js> load('build/runtest/env.js');
js> window.location = 'test/index.html';
test/index.html
js> load('dist/jquery.js');
js> load('build/runtest/testrunner.js');
js> load('src/jquery/coreTest.js');
PASS (1) [core] Array.push()
PASS (2) [core] Function.apply()
PASS (3) [core] getElementById
PASS (4) [core] getElementsByTagName
PASS (5) [core] RegExp
PASS (6) [core] jQuery
...
Oh yes, that's right - the full jQuery test suite is now automated and capable of running in Rhino (passing all tests). jQuery served as my initial testbed for development, making sure that I was getting all of my code right. So if you import a copy of jQuery into this environment, it should work "just fine".
By the way, you can try out the automated test suite by getting a copy of trunk/jquery out of SVN, then running make runtest - the results are just awesome.
Screen Scraping
This is one part that works pretty well right now - with the huge caveat that it only works on well-formed XML documents (oops!). I'll be integrating an HTML parser into the code base so that we can make this functionality a little more resilient. In the meantime, here's an example of the sort of scraping that you can do currently:
load("env.js");
window.location = "http://alistapart.com/";
window.onload = function(){
load("dist/jquery.js");
print("Newest A List Apart Posts:");
$("h4.title").each(function(){
print(" - " + this.textContent);
});
};
And here's another one that writes the results out to a file:
load("env.js");
window.location = "http://alistapart.com/";
window.onload = function(){
load("dist/jquery.js");
var str = "Newest A List Apart Posts:\n";
$("h4.title").each(function(){
str += " - " + this.textContent + "\n";
});
var out = new XMLHttpRequest();
out.open("PUT", "file:/tmp/alist.txt");
out.send( str );
};
Oh yeah, I went there - I made PUT and DELETE requests to local files perform the expected actions. I think the result is hilarious.
Web Application Development
This is still a work in progress, but some of the initial ideas are already at play here in this environment. When I have some time I plan on making a JavaScript-based web app framework out of this - which should be pretty cool.
Here's some psuedo-code for how I think it could work:
window.onload = function(){
print("Content-type: text/html\n");
if ( location.href == "/" )
show_home();
print( document.innerHTML );
};
function show_home(){
document.load("index.html");
document.getElementById("time").innerHTML = (new Date()).toString();
}
Download!
Check out the code - there's still huuuge gaps of functionality missing - I only implemented the bare minimum to get this environment working (and passing the jQuery test suite). So your mileage may vary.
Download: http://jqueryjs.googlecode.com/svn/trunk/jquery/build/runtest/env.js (Formatted)
How to Use
To start with, you'll need to have, at least, Rhino 1.6R6. You can download it from Mozilla FTP.
Now download the env.js script and put it in the same directory as the Rhino js.jar.
In order to use it from the command-line, you'll wanna do something like this:
$ java -jar js.jar
js> load('env.js');
js> window.location = 'some.html';
some.html
js> // Your code here!
It's important that you do window.location = "some file" before loading any DOM-dependent code (as the 'document' object doesn't exist before the location request).
A full list of Rhino-shell-specific commands can be found in the Rhino Shell docs.
If you want to write executable scripts, the contents will look something like this:
load('env.js');
window.location = 'some.html';
window.onload = function(){
// Your code here
};
Which can then run like so: java -jar js.jar myscript.js.
Feedback is very much welcome - I've only thought of a couple use-cases thus far, but I'm sure that the surface is just being scratched.
Tags: rhino, java, ecmascript, firefox, mozilla, javascript
61 Comments on 'Bringing the Browser to the Server'
July 3rd, 2007
For my work at Mozilla, I'm gearing up to talk more about JavaScript 2.0. This involves a lot of things (from reading up on the specification, looking at non-web-based uses of JavaScript, to teaching myself SML). Perhaps most challengingly, however, is the struggle that I've been facing to quantify and understand the shifts being made in the language - and how that relates to JavaScript programming in general.
I think we've seen the JavaScript language move through many individual phases:
- The "We need scripting for web pages" phase. (Netscape)
- The "We should standardize this" phase. (ECMAScript)
- The "JavaScript isn't a toy" phase. (Ajax)
- The "JavaScript as a programming language" phase.
I'm surmising that there's this new phase that we're starting to enter, one where JavaScript will be treated as a significant programming language - divorced from the concept of web development. Two significant movements lead me to believe that we're at the start of a new era for JavaScript.
JavaScript Speed
A good deal of energy has been put into worrying about JavaScript performance. This is a great sign. It's sort of a natural progression for a language (worry about implementation, then standardization and compliance, and finally speed).
For proof, look at the work that's being done by the different browser vendors:
- Mozilla is working on Tamarin (JIT JavaScript)
- Apple is working on Webkit/Safari 3 (Revamped JS Engine)
- Opera is releasing a new JS Engine in Opera 9.5 (New features and speed improvements)
- Microsoft is working on Internet Explorer 8.0 (A bunch of new JS work)
Non-Web-based Use
I've been reading a lot about the use of JavaScript in non-"traditional" situations; especially in relation to the use of Rhino (the JavaScript implementation that sits on top of Java and the JVM).
Specifically, two projects have really stood out as having a lot of potential.
JavaScript on Rails - Granted, at this point, this project may as well be pure vaporware, but it's caught the attention of the right people. When one of the most popular software bloggers talks about how there's a "next big language" coming up and then announces his massive re-write of the popular Ruby on Rails framework, in JavaScript, running on Rhino - people tend to pay attention.
Helma - This web application framework is a long standing stalwart of server-side development with JavaScript (again, using Rhino). Surprisingly, it's managed to fall through the cracks with just about every JavaScript developer that I know. I recently noticed it, and after some startup friends of mine revealed that they're developing an application based on it, I became convinced that we'll be hearing about this little framework in the upcoming months.
All of this leads me up to a point: JavaScript is actively advancing, as a language. While it's most popular domain will probably always be in web browsers (with new JavaScript engines pointing in that continued direction), the advancement of server-side uses of JavaScript will only make for a much larger area for possible development in the upcoming years.
This is all a convoluted way of saying that this is the perfect opportunity to introduce some much needed changes into the language - completing the extended transition of JavaScript from a toy to a professional development tool.
Tags: mozilla, ecmascript, programming, javascript, rhino
54 Comments on 'JavaScript as a Language'