Revised JavaScript Dictionary Search


After my two previous posts discussing dictionary lookups in JavaScript and JavaScript Trie performance analysis even more excellent feedback came in from everyone.

Out of all the results two techniques seemed to be most interesting – and promising for reducing general memory usage and load time.

String-based Binary Search

The first technique proposed was left as a comment by a commenter named Randall. He utilized a technique posited by Mathias Nater to create a smaller dictionary and search the results using a binary search.

The technique breaks down into two parts: The smaller dictionary and the binary search.

The smaller dictionary is achieved by taking the regular dictionary and grouping the words by length. Thus you’ll end up with clusters of words that are all equal lengths. For example:

  1. [
  2.   "catdoghat",    // all length 3
  3.   "foodridewood", // all length 4
  4.   ... etc.
  5. ]

If all the words in a particular string are the same length then it makes it quite easy to jump around inside it (you just always move forwards or backwards in the string one word’s length). This reduces the resulting, uncompressed, file size of the dictionary by about 110,000 bytes (there is now one less space for each word in the dictionary).

Having the words of uniform length makes it easy to implement a binary search. The code written by Randall is rather simple and effective:

  1. function findBinaryWord( word ) {
  2.   // Figure out which bin we're going to search
  3.     var l = word.length;
  4.    
  5.     // Don't search if there's nothing to look through
  6.     if ( !dict[l] ) {
  7.         return false;
  8.     }
  9.    
  10.     // Get the number of words in the dictionary bin
  11.     var words = dict[l].length / l,
  12.         // The low point from where we're starting the binary search
  13.         low = 0,
  14.        
  15.         // The max high point
  16.         high = words - 1,
  17.        
  18.         // And the precise middle of the search
  19.         mid = Math.floor( words / 2 );
  20.    
  21.     // We continue to look until we reach a final word
  22.     while ( high >= low ) {
  23.         // Grab the word at our current position
  24.         var found = dict[l].substr( l * mid, l );
  25.        
  26.         // If we've found the word, stop now
  27.         if ( word === found ) {
  28.             return true;
  29.         }
  30.        
  31.         // Otherwise, compare
  32.         // If we're too high, move lower
  33.         if ( word < found ) {
  34.             high = mid - 1;
  35.        
  36.         // If we're too low, go higher
  37.         } else {
  38.             low = mid + 1;
  39.         }
  40.        
  41.         // And find the new search point
  42.         mid = Math.floor( (low + high) / 2 );
  43.     }
  44.    
  45.     // Nothing was found
  46.     return false;
  47. }

Doing a binary search will almost certainly yield a faster result compared to a simple linear search through the string (such as doing a .indexOf).

Trie Stored in a Succinct Data Structure

The other solution that popped up was proposed by Steve Hanov. His solution was to continue utilizing the Trie structure but store it in a succinct data structure. He explains it extremely well so I would highly recommend visiting his blog to get the full details of the solution.

In short all the leaves of the Trie are given a number and stored in a table (resulting in a data structure for lookups and a data structure for the actual data itself). Additionally he went a step further and encoded all the data using Base 64 and modified the algorithm to utilize that encoding – thus all the encoded data can remain as a string and never have to be turned into a large JavaScript data structure.

Ideally this solution will provide a much smaller file size and memory usage.

File Size Analysis

We still want to make sure that the file size of the dictionary isn’t too large – saving us some potential bandwidth costs. Digging in to the file size results yields some very interesting data.

Revised Dictionary File Size

Unsurprisingly the uncompressed binary string is smaller than the plain string – but quite surprisingly the final gzipped size actually ends up being larger than the original. I can only assume that the spacing in the original ended up helping the gzip algorithm compress the resulting text easier – definitely not an intuitive result, though.

Additionally when we look at the results for the succinct Trie we do see that it’s uncompressed file size is quite small (only about 300KB). Amusingly, its compressed file size ends up being larger than any of the previous Trie structures – but still smaller than the regular strings.

If we were only concerned with file size then the suffix optimized Trie would still be the way to go.

Load Speed Analysis

As with before, analyzing the initial load speed of the dictionary ends up being crucial. If we’re utilizing a JavaScript object we have to evaluate it first – and that can end up taking a considerable amount of time.

Revised Dictionary Load Speed

Thankfully both of the new techniques load virtually instantaneously (taking only 0.65 – 1.7ms to load the whole dictionary, which makes sense, since they’re both coming from simple strings and require no code evaluation).

Search Speed Analysis

As I mentioned in my last post, the performance of word lookups wasn’t nearly as important in my particular use case as all the other considerations. Having a lookup take a couple milliseconds was OK since they were only going to occur every couple seconds or so. This may not be the case for every application – and I still wanted to make sure that the code wasn’t going to take any major performance hits.

Revised Dictionary Search Speed

The binary string search ends up being quite fast – even faster than the old Trie search, which is quite promising. However the new Succinct Trie ends up being disturbingly slow (taking about 5.5ms to look up a word in Node.js on a fast machine). I double-checked and the lookup time for finding a word on an iPhone 4 using iOS 4.3 is about 46ms. Even taking that into account I think that’s still a lookup rate that I can live with in my particular application. However it most certainly will be too slow for many situations (and in those cases they’ll likely want to use the Binary String technique).

It’s not clear to me how much of the Succinct Trie’s performance stems from its implementation. Perhaps there could be some improvements made to it to increase its final word search time.

Memory Usage Analysis

I finally revisited memory usage, as well. This continues to be critical on mobile devices. Using too much memory will cause all sorts of problems and frequently force the browser to reload your application.

Revised Dictionary Memory Usage

The Binary String technique does well. Being a simple string (and of a smaller total size) means that it uses less memory as well. However the Succinct Trie technique does absolutely stellar – as it’s a string as well, and since it’s total uncompressed size is only about 1/3 of a MB, that’s all the memory that it ends up using. This is fantastic and certainly a huge encouragement towards using this technique.

Conclusion

Even though it’s not quite as small (compressed file-size-wise) as a regular Trie and even though its lookup speed is significantly slower than any other technique (but still usable) – I’m heavily inclined to use the Succinctly Stored Trie Structure. I would definitely like to encourage everyone to look at this particular technique further as, currently, it provides excellent memory usage, reasonably-small file size, and virtually non-existent startup overhead. If you’re OK with having relatively slow word lookups then this is certainly the technique for you.

If slow lookup speed is undesirable then I would be inclined to use the binary string search technique (as it still provides decent memory usage, reasonable file size, and reasonable memory usage).

As always, all the code and tests are up on Github:
https://github.com/jeresig/trie-js

Raw Data

The raw data for the updated tests (which can be found in my Trie.js Github repo) are as follows:

Binary String
--------------
Normal: 806KB
Gzipped: 326KB
Build Speed: 174 (1.74ms per)
Find Speed: 369 (0.0033ms per)
Not Find Speed: 197 (0.0012ms per)
Private Memory: 112.2MB (0.98MB per)

Succinct Trie
--------------
Normal: 317KB
Gzipped: 197KB
Build Speed: 65 (0.65ms per)
Find Speed: 609985 (5.43ms per)
Not Find Speed: 622700 (5.54ms per)
Private Memory: 47.1MB (0.33MB per)

Posted: March 22nd, 2011


If you particularly enjoy my work, I appreciate donations given with Gittip.

24 Comments (Show Comments)



Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.


Secrets of the JavaScript Ninja

Secrets of the JS Ninja

Secret techniques of top JavaScript programmers. Published by Manning.

Ukiyo-e Database and Search

Ukiyo-e.org

Japanese woodblock print database and search engine.


John Resig Twitter Updates

@jeresig

Infrequent, short, updates and links.