Blog


JavaScript Benchmark Quality

Summary: JavaScript Benchmarks aren't adapting well to the rapid increase in JavaScript engine performance. I provide some simple tests for verifying this and propose a modified setup which could be used by all JavaScript Benchmarks to achieve high-quality results.

There now exists three, what I would consider to be, major JavaScript performance benchmarks. Each are released by a major browser vendor. WebKit released SunSpider, Mozilla released Dromaeo, Google released the V8 Benchmark.

Each suite has a variety of tests and a test runner. I'm currently interested in one thing: The quality of the test runner that each of these suites provides.

There are three points that any test runner tries to achieve:

  1. Retrieving accurate numbers. Another way to phrase it: "Retrieving stable numbers." If you run the test suite multiple times will you get identical, or near-identical, numbers?
  2. Reducing the possible error. Does the suite attempt to quantify how much possible error there could be in their results? How large is the error?
  3. Reducing run time. How long does it take to run each test?

The ideal suite would be one that's capable of running an individual test as quickly and accurately as possible with virtually no error. However, in order to get those numbers you need to carefully chose what style of tests you wish to run.

I quantify the current field of tests into three categories:

Slow-running tests: These tests generally take a couple hundred milliseconds on an average consumer machine. This is the style of tests that you generally see in SunSpider and occasionally in Dromaeo. These tests have a distinct advantage: They are generally quite accurate. You don't need to run them very many times (say about 5) in order to get a consistent picture of how the test will run.

Moderate-running tests: These tests take less than one hundred milliseconds. You'll see these tests in Dromaeo. These tests need to be run quite a few times (20-30) in order to get a better picture of their speed.

Fast-running tests: These tests take less than ten milliseconds. You'll see these tests in the V8 Benchmark and in Dromaeo. These tests must be run many, many, times (usually close to a thousand) in order to weed out any possible error. If you were to run one 10ms test 5 times - and get a single result of 11ms that would introduce a significant level of error into your results. Consider tests that take 0-1ms. A deviation within that range can instantly cause error levels around 50% to occur, if enough test iterations aren't completed.

Looking at the above categories the solution seems obvious: Use slow-running tests! You get to run them fewer times and you get accurate results - everything's peachy! But here's the problem: The speed of JavaScript engines are increasing at a rate faster than test suites can adapt. For example, SunSpider was initially developed with all tests running at equal speeds on a modern computer. Now that speed improvements have come along, though, (WebKit improvements, then Mozilla, then SquirrelFish, then TraceMonkey, then V8) those results don't even remotely resemble the tests of old. Most of the results have moved down into the moderate-running range of tests, some even into the fast-running range - but here's the problem: They're still only running the originally-designed number of loops. An example of the difference:

This means that a browser is running a test for 5-10 loops (in both SunSpider or Dromaeo) but the speed of the test no longer matches that assigned number of iterations. At first glance you could say one of two things: 1) Increase the difficulty of the tests or 2) Have them run for more iterations. There are problems with this, though.

While you certainly could increase the complexity of existing tests the result would be a never-ending battle. Improvements would have to land every couple months in order to keep up with the pace of improvement. This would work ok in Dromaeo (since it has versioned tests) but not all suites can handle this behavior. Additionally this now makes the tests less-useful for measuring slower browsers (the tests now take an absurd amount of time to complete as opposed to the very-plausible numbers from before).

Additionally, you could increase the number of test iterations that would occur but not without assigning specific iteration counts to each individual tests. And this is the full problem: How do you know what numbers to choose? Raising the number to 20 iterations may help one browser - but what about another browser which will need 100 iterations to get a proper count?

This leaves us in a bind. Browsers keep getting faster at tests, test suites do the wrong number of iterations, causing the error level to continually increase:

We should take a step back and look at what the test suites are doing to counter-act the above trend from happening - if anything at all.

SunSpider was originally dominated by long-running tests running 5 times each. The tests use to be long-running but are now only in to the medium to fast-running range (depending on the browser). This has caused the accuracy to decrease and error level to increase. Increasing the number of iterations would help (but hinder older browser performance).

Dromaeo has a range of tests (fast, moderate, and long-running) each running 5-10 times each. Dromaeo attempts to correct the number of iterations run, right now, but kind of fails when doing so. It looks at the results of past iterations (especially the error level generated by the results) and decides to run more tests until a stable error level is achieved. The problem with this is the samples are no longer being independently determined. Whereas test runs 1-5 were independent, test runs 6-10 were not (they're only being run due to the fact that previous test runs provided poor results). So while the results from Dromaeo are hyper-stable (they're the most stable performance test that we run at Mozilla) they're not determined in a proper statistical manner. Thus Dromaeo needs to be changed in order for people to be able to gather accurate results without sacrificing its statistical integrity.

The V8 Benchmark takes a completely different strategy for its fast-running tests: Instead of running for a pre-determined number of iterations each test is run continuously until a second of time has passed. This means that individual tests frequently run anywhere from 50-200 times (depending on the machine and browser they run on). Currently the V8 Benchmark does suffer from one shortcoming: There is no error calculation done. Both SunSpider and Dromaeo fit the results to a t-distribution and compute the possible error of the results whereas the V8 Benchmark just leaves them as is.

However, the V8 Benchmark does bring up a very interesting strategy. By picking tests that are simpler (and, arguably, most current complex tests will become "simple" as engines rapidly improve) and running them more times (relative to the complexity of the test) the results become much more stable.

Consider the result: Fast-running tests end up having a smaller error range because they are able to run more within a given allotment of time. This means that the test runner is now self-correcting (adjusting the number of iterations seamlessly). Since all JavaScript engines are getting faster and faster, and complex tests are running in shorter and shorter amounts of time, the only logical conclusion is to treat all tests as if they were fast tests and to run them a variable number of times.

We can test this hypothesis. I've pulled together a simple demo that tests the computational accuracy of the three styles of test (Simple - or Fast-running, Moderate, and Complex - or Slow-running) against the two different types of suites: "Max" style (the style implemented by the V8 Benchmark) and "Fixed" style (the style implemented by SunSpider and Dromaeo). The results are quite interesting:

Fast and moderate-speed tests are incredibly accurate with the max-style of test running. Often their error is no more than 2% (which is really quite acceptable). The quality degrades for the complex tests, but that's ok. Complex tests could be tweaked to consume less resources (thus allowing them to iterate more times and become more accurate).

Right now the accuracy of slow-to-moderate tests in fixed run test suites are suffering from increased rates of error as the engines get faster (as can be seen in the above chart).

The major change that would need to be made to the flexible style of execution (at least the one implemented in the V8 Benchmark) would be some form of error checking and t-distribution fitting. In V8 a series of tests are run over the course of one second. I propose that the resulting number (the total number of tests executed) be saved and the tests are repeatedly run again (perhaps 5 times, or so). The resulting set of 5 numbers (representing, potentially, thousands of test runs) would then be analyzed for quality and error level. I have this measure completely mocked up in my sample test measurement.

The relevant portion of the code is here:

function runTest(name, test, next){
  var runs = [], r = 0;

  setTimeout(function(){
    var start = (new Date).getTime(), diff = 0;

    for ( var n = 0; diff < 1000; n++ ) {
      test();
      diff = (new Date).getTime() - start;
    }

    runs.push( n );

    if ( r++ < 4 )
      setTimeout( arguments.callee, 0 );
    else {
      done(name, runs);
      if ( next )
        setTimeout( next, 0 );
    }
  }, 0);
}

Switching to a flexible style of text execution has many benefits:

  • It'll help to keep the rate of error small for most tests (especially if the tests are tailored to be smaller in nature).
  • It'll allow the suite to gracefully adapt over time, as engines speed up, without sacrificing the ability to work on old engines.
  • The maximum number of runs (and, thus, the maximum amount of accuracy) will occur for the engines that are able to complete the tests faster. Since the greatest amount of competition is occurring in these high numbers (as opposed to in older user agents, where there is no progress being made) granting them the most accurate numbers possible makes this ideal.

I plan on prototyping this setup into Dromaeo and releasing it shorting. It'll take much longer to run the full test suite but the result will be quite worth it.

Tags: webkit, tracemonkey, v8, javascript, mozilla

TraceMonkey

I've been waiting to blog about this for a long time now. A fantastic new improvement to Mozilla's JavaScript engine (SpiderMonkey) has landed. Code-named TraceMonkey this engine utilizes a techniques, called trace trees (PDF), which adds just-in-time native code compilation to SpiderMonkey.

A major goal of the project has been to set JavaScript up to compete with natively-compiled code, rather than simply against other interpreters. This means that we're starting to see speeds that are completely out of this league when it comes to performance.

Results and Try it Yourself

Here are the results from four benchmarks to give you a taste:


(Graph courtesy of Brendan Eich.)

The tests are:

If you want to try these out for yourself, just snag a nightly of Firefox 3.1, open about:config, and set the following preference to true:

javascript.options.jit.content

You should be, happily, in just-in-time tracing land. It's still buggy (hence the reason for hiding behind the preference wall) but it should be good enough to handle most web sites.

What's especially exciting is that this code is working on x86, x86-64, and ARM - which means that these improvements won't be limited to just the desktop - you'll be able to receive them on your mobile devices as well.

How Tracing Works

In simple terms tracing works by watching for commonly-repeated actions (such as loops, function calls, or type checking) and tries to optimize their resulting native code into the lowest number of actions. The premise is rather simple - and it's an advance that we'll probably see proliferate to many interpreters and engines in the upcoming years.

Andreas Gal published a paper (PDF) on the subject and Brendan Eich has written up a TraceMonkey-specific explanation.

Some of the improvements made by tracing include:

  • Function Inlining: Removing the overhead of function calls by simply replacing them with their resulting native code.
  • Type Inference: Removing checks surrounding common operators (like "+") when the types contained within a variable are already known. This means that the engine will have already pre-determined, for example, that two strings need to be concated when it sees the "+" operator.
  • Looping: The overhead of looping has been grossly diminished. It's one of the most common areas of overhead in JavaScript applications (common repetition of a task) and the constant determining of bounds and the resulting inner code is made negligible.

The code for this project has come from a number of places - one of which is coming from some code contributed to Mozilla, from Adobe: Tamarin Tracing, specifically the nanojit code that's able to work a lot of this just-in-time magic.

Development

The work began just about 60 days ago, working with Andreas Gal of UC Irvine, to integrate the nanojit technology into Spidermonkey. You can hear more about the development from those that were involved: Andreas Gal, Mike Shaver, and Brendan Eich.

The full code can be found in the TraceMonkey mercurial repository (the commit to merge TraceMonkey into Mozilla core is massive, clocking in at about 4MB).

If you want to try running your own copy of TraceMonkey on the command-line, just follow the steps outlined here.

There is still a ton of work to be done. The incredible speed-ups that we're seeing are only just the beginning. A lot can be done to improve how registers are currently being allocated which will provide even more speed-ups.

Right now there isn't any tracing being done into DOM methods (only across pure-JavaScript objects) - but that is something that will be rectified. Being able to trace through a DOM method would successfully speed up, not only, math and object-intensive applications (as it does now) but also regular DOM manipulation and property access.

Ramification

So what does this all mean? It means that JavaScript is no longer confined by the previously-challenging resource of processing power. With this improvement it's leap-frogged any sort of traditional and has gone head-to-head with computationally-powerful languages like C.

I fully expect to see more, massive, projects being written in JavaScript. Projects that expect the performance gains that we're starting to see. Applications that are number-heavy (like image manipulation) or object-heavy (like relational object structures).

One area that I'm especially excited about is in relation to Canvas. The primary thing holding back most extensive Canvas development hasn't been rendering - but the processor limitations of the language (performing the challenging mathematical operations related to vectors, matrices, or collision detection). I expect this area to absolutely explode after the release of Firefox 3.1 as we start to see this work take hold.

Seeing releases like this are absolutely exciting for me. JavaScript is absolutely the little-language-that-could - continually routing around any of its short-comings and blowing away all of its expectations. I look forward to using it for many, many, years to come.

Tags: tracemonkey, mozilla, firefox, javascript

JavaScript Books

Secrets of the JavaScript Ninja

JavaScript Secrets

Secret techniques of top JavaScript programmers.

Pro JavaScript Techniques

Pro JavaScript

The best techniques for professional JavaScript. Published by Apress.

Micro Updates

John Resig Twitter Updates

@jeresig

Infrequent, short, updates and links.

JavaScript Jobs



Hosting provided by: Ruby Hosting by Engine Yard