It’s damn easy to sculpt tests to make yourself look good – or others look bad. I’d be a liar if I said that I didn’t knowingly write tests that emphasized our best selectors in jQuery. Of course, everyone does this – it’s a game of tactics. The issue, though, is that this sort of strange test writing behavior only arises when you’re competing on stats – you want your code to look the best that it can. It’s only when you divorce yourself from competition, and take a step back, that you can truly start to understand what it is that’s most important to users and what’s a true performance problem.
A couple examples:
In jQuery there’s the CSS 3 ~ selector. A stupid, stupid, worthless selector. No one uses it I wish it would go away. I had actually removed this selector from jQuery as no one used it, it wasn’t missed in the slightest. (I’ve removed a huge number of others as well – no one has ever complained about the missing :nth-last-of-type pseudo-selector.) However, once libraries started to compete (I think it was ExtJS that first released a speed test that compared us on ‘~’) users lose. They get bloat (an extra feature that no one used), they lose performance elsewhere (on the overhead of supporting the extra selector), and it takes away from the time of the developers to resolve. In the end I re-implemented the ~ selector to meet this non-existent demand – and later on had to dedicate time to improving its speed (again, for the speed-test-suite-induced demand) – all instead of working more important things like bug fixes.
In Firefox, there’s been a lot of analysis of different speed test suites (including the one produced by yours truly) and attempts to figure out what tests are actually meaningful. One such test was looking at sorting an array of integers using .sort(). Specifically, however, this test was fundamentally flawed, observe:
[js]var a = ;
for (i = 0; i < 6; i++) a.push(parseInt(1000 * Math.random())); a.sort(); // => [1, 11, 3, 5, 7, 9][/js]
Notice the problem? This test was using the default .sort() method which sorts the contents of the array as if they were strings. This is a nonsense test. There is virtually no use use case for doing performance testing of comparing-integers-as-strings. And yet there it is; confusing users and wasting developer time.
Now, it’s really easy to test stuff like sorting and looping – that requires no user interaction whatsoever. But how do you objectively test the speed of things like “the user clicks this button and this div appears” or “how smooth is the animation of this div from point A to point B”? It isn’t completely clear how to make this happen – especially in an unbiased way. For example, different forms of event triggering could be used to simulate user clicks but do those properly simulate an actual user click? Do they happen faster? slower? What if a browser’s event trigger system is quite slow but their normal UI experience is excellent? It’s not clear what the right answer is, but no one has solved it yet. This is something that I hope to be looking into in the upcoming weeks and months.
The problem really boils down to a matter of “mircotests” in comparison to “real world” tests. Effectively running a single test hundreds, or thousands, of times trying to get a good statistical result for analysis. This is ripe for error, cheating, and general unfeasibility. When was the last time you did
document.getElementById("test") back-to-back 500 times or
$("#test") for that matter? You didn’t. Any sane person would store the result in a variable and access it again later. The only thing that a test like that does is encourage library, or browser, authors to provide bias for unrealistic tests. Is the overhead of a caching system worth it if, in reality, there are virtually no cache hits? It doesn’t matter, though, since competitive testing can lead people to implement unnecessary systems like these, purely for the sake of stats.
Now microtests do have their place – but that place should not be one of public competition but of personal introspection. The first step to competitive performance analysis should always focus on the users. It should be all about what the users are actually doing in their day-to-day browsing. Only after you’ve identified problem tests that you’d like to improve do you move to microtests. Since they don’t serve as a good basis for public comparison (leading to unrealistic cheating, etc.) it’s best to keep tests like those internal. Once you can do that they can become quite useful. User clicks a link and the animation is choppy? How’s our timer performance? CSS property manipulation? Closure speed? DOM element accesses? All of these are the type of things that microtests were designed for – looking into the root cause of the real world problems.
In summary, there’s three things that I think are really important about performance testing:
Use competition to light a fire. Competitive performance testing has its place. Personally, seeing poor performance results for my code makes me angry – which is good, because it gets me excited about improving the speed of my code. This is good as, theoretically, the users will benefit in the end.
Test real world code. Competitive, or even public, testing should strive to analyze real world code. Striving to simulate a user’s natural experience as closely as possible should be the ultimate goal. It’s only when you get to this point do you no longer cheat yourself, others, and especially – your users.
Analyze performance against yourself. You should be your greatest enemy. Remember that last release of your code? It should be your goal to make it look as bad as possible. In the end your users will be able to see actual results and competition will be healthy (no one gets hurt) and clean (there’s no incentive to cheat).