Dictionary Lookups in JavaScript


I’ve been working on a browser-based word game, naturally written in JavaScript, and have been encountering some interesting technical challenges along the way. I’ve written up my thought process here for others to learn from (note that most of this happened over the course of a month, or so). I’ve often found that while a final solution to a problem may be rather elegant and “make perfect sense” when looking at it – it’s only through the result of much trial and error that the solution was arrived upon.

To start, in my game, the user is frequently re-arranging letters – causing the game to look up words in a dictionary to see if they are valid, or not.

I’ve taken multiple passes at implementing a solution to this problem, ranging all the way from “I don’t care about performance, I just want it to work” all the way up to “thousands of people could be playing simultaneously, how do I scale?”

For my seed dictionary I used a variation of the Scrabble dictionary, which can be found via some creative Googling. The full dictionary ends up being around 916KB (with words separated by an endline).

Server-Side Solution

The first pass was stupid simple. It worked, but just barely. I took the dictionary and split it up into 26 files – one for each letter of the alphabet – and put all the words that started with the corresponding letter in the text file.

I then made a little PHP script to handle user requests – reading in the portion of the dictionary file and returning “pass” or “fail” if the word was found.

  1. <?php
  2.     # Get the word to be checked from the user
  3.     $word = $_GET['word'];
  4.    
  5.     # Get the first letter of that word
  6.     $first = substr( $word, 0, 1 );
  7.  
  8.     # Open the corresponding file
  9.     $handle = fopen("words/" . $first . ".txt", "r");
  10.     if ( $handle ) {
  11.         # Keep going until the end of the file
  12.         while (!feof($handle)) {
  13.             # Get the word in the dictionary
  14.             # (removing the endline)
  15.             $line = trim(fgets($handle));
  16.            
  17.             # And see if the word matches
  18.             if ( $line == $word ) {
  19.                 # If so return "pass" to the client
  20.                 echo "pass";
  21.                 exit();
  22.             }
  23.         }
  24.         fclose($handle);
  25.     }
  26.  
  27.     # If we made it to the end, then we failed
  28.     echo "fail";
  29. ?>

(Please don’t use the above code, it’s terrible.)

Thus you would call the PHP script like so:

/words.php?word=test

And it would return either “pass” or “fail” (meaning that it would only work on the same domain).

So while the above worked it wasn’t nearly fast enough and it consumed a ton of memory on the server – each request was reading in large portions of potentially 100KB+ files. Time for a better solution.

A Better Server-Side Solution

My next attempt was to write a simple little web server (in Perl, this time) that pre-loaded the entire dictionary into memory, and handled all the requests via JSONP.

  1. #!/usr/bin/perl
  2.  
  3. use CGI;
  4.  
  5. # Read the dictionary into the word hash
  6. my %words = map { $_, 1 } split(/\n/, `cat dict/dict.txt`);
  7.  
  8. # Instantiate and start the web server
  9. use base qw(Net::Server::HTTP);
  10. __PACKAGE__->run( port => 8338 );
  11.  
  12. sub process_http_request {
  13.     # Make a CGI object for the request
  14.     my $cgi = CGI->new;
  15.  
  16.     # Get the word from the user
  17.     my $word = $cgi->param( "word" );
  18.     $word =~ s/\W//g;
  19.  
  20.     # And the callback name
  21.     my $callback = $cgi->param( "callback" );
  22.     $callback =~ s/\W//g;
  23.  
  24.     print "Content-type: text/javascript\n\n";
  25.  
  26.     if ( $word && $callback ) {
  27.         # Dump back the results in a JSONP format
  28.         print $callback . '({"word":"' . $word .
  29.             '","pass":' . (defined $words{ $word } ?
  30.                 'true' : 'false') . '})';
  31.     }
  32. }

You would call the above using something like this:

http://example.com:8338/?word=test&callback=wordFound

There are a bunch of things that I don’t like about the above code (like using a hash to lookup the words, when more memory-efficient solutions exist, and instantiating a CGI object on every request – things like that). Also if I were to do this today I’d probably write it in Node.js, which I’m becoming more familiar with.

Compared with the previous solution though there are a large number of advantages. Since the dictionary is being loaded into memory only once, and into a hash, it makes lookups very, very, fast. I was timing entire HTTP requests at only a handful of milliseconds, which is great. Additionally since this solution utilized JSONP it was possible to set up a server (or multiple servers) dedicated to looking up words and have the clients connect to them cross-domain.

A Client-Side Solution

It was at this point that I realized a couple problems with the previous server-side solutions. If my game was going to work offline (or as a distributable mobile application) then a constant connection to a dictionary server wasn’t going to be possible. (And if slow mobile connections were taken into account then the responsiveness of the server just wasn’t going to be sufficient enough.)

The dictionary was going to have to live on the client.

Thus I wrote a simple Ajax request to load the text dictionary and make a lookup object for later use.

  1. // The dictionary lookup object
  2. var dict = {};
  3.  
  4. // Do a jQuery Ajax request for the text dictionary
  5. $.get( "dict/dict.txt", function( txt ) {
  6.     // Get an array of all the words
  7.     var words = txt.split( "\n" );
  8.  
  9.     // And add them as properties to the dictionary lookup
  10.     // This will allow for fast lookups later
  11.     for ( var i = 0; i < words.length; i++ ) {
  12.         dict[ words[i] ] = true;
  13.     }
  14.    
  15.     // The game would start after the dictionary was loaded
  16.     // startGame();
  17. });
  18.  
  19. // Takes in an array of letters and finds the longest
  20. // possible word at the front of the letters
  21. function findWord( letters ) {
  22.     // Clone the array for manipulation
  23.     var curLetters = letters.slice( 0 ), word = "";
  24.    
  25.     // Make sure the word is at least 3 letters long
  26.     while ( curLetters.length > 2 ) {
  27.         // Get a word out of the existing letters
  28.         word = curLetters.join("");
  29.    
  30.         // And see if it's in the dictionary
  31.         if ( dict[ word ] ) {
  32.             // If it is, return that word
  33.             return word;
  34.         }
  35.  
  36.         // Otherwise remove another letter from the end
  37.         curLetters.pop();
  38.     }
  39. }

Note that having the dictionary on the client actually allowed for an interesting new form of game play. Previously the player had to select a specific word and send it to the server – and only then would the server say if that word was valid or not. Since the dictionary now lives on the client (making lookups instantaneous, in comparison) we can change the logic a bit: The game now looks for the longest word at the start of the user’s letters. For example, if the user had the letters “rategk” then the function would work like so:

  1. findWord( [ "r", "a", "t", "e", "g", "k" ] )
  2. // => returns "rate"
  3.  
  4. findWord( [ "k", "t", "a", "g", "e", "k" ] )
  5. // => returns undefined

Naturally, something similar could’ve been done on the server-side – but that wasn’t readily apparent when working on the server. This is a case where a performance optimization actually created a new, emergent, form of gameplay.

But here’s the rub: We’re now sending a massive dictionary down to a client. It’s 916KB – and that’s absolutely massive.

Optimizing the Client-Side Solution

Compression and Caching

Turning on Gzip compression on the server reduces the dictionary file size from 916KB to a much-more-sane 276KB. Additionally configuring the cache-control settings of your server will ensure that the dictionary won’t be requested again for a very long time (assuming that the cache in the browser isn’t cleared).

There are some excellent articles already written on these techniques:

Content Delivery Networks

Of course, all of this is a bit of a given when you use a Content Delivery Network – which I most certainly do. Right now I’m using Amazon Cloudfront, due to its relatively simple API, but I’m open to other solutions. This means that the dictionary file will be positioned on a large number of servers around the globe and served to the user in the fastest manner possible (using both gzip and proper cache headers).

Cross-Domain Requests

There’s a problem, though: We can’t load our dictionary from a CDN! Since the CDN is located on another server (or on another sub-domain, as is the case here) we’re at the mercy of the browser’s cross-origin policy prohibiting those types of requests. All is not lost though – with a simple tweak to the dictionary file we can load it across domains.

First, we replace all endlines in the dictionary file with a space. Second, we wrap the entire line with a JSONP statement. Thus the final result looks something like this:

dictLoaded('aah aahed aahing aahs aal... zyzzyvas zzz');

This allows us to do an Ajax request for the file and have it work as would expected it to – while still benefitting from all the caching and compression provided by the browser.

I alluded to another problem already: If the browser doesn’t cache the file, for some reason, or if the cache runs out of space and the file is expunged – it’ll be downloaded again. I want to try and reduce the number of times in which 216KB file will be downloaded – if not for the users then for reducing my bandwidth bill.

Local Storage

This is where we can use a great feature of HTML 5: Local Storage. Mark Pilgrim has a great tutorial written up on the subject that I recommend to all. In a crude nutshell: You now have an object that you can stuff strings into and they’ll be persisted by the browser. Most browsers give you around 5MB to play with – which is more than enough for our dictionary file. It also has great cross-browser support – with the simple API working across all modern browsers.

With a little tweak to our Ajax logic (taking into account the JSONP request, the CDN, and Local Storage) we now end up with a revised solution:

  1. // See if the property that we want is pre-cached in the localStorage
  2. if ( window.localStorage !== null && window.localStorage.gameDict ) {
  3.     dictReady( window.localStorage.gameDict );
  4.  
  5. // Load in the dictionary from the server
  6. } else {
  7.     jQuery.ajax({
  8.         url: cdnHREF + "dict/dict.js",
  9.         dataType: "jsonp",
  10.         jsonp: false,
  11.         jsonpCallback: "dictLoaded",
  12.         success: function( txt ) {
  13.             // Cache the dictionary, if possible
  14.             if ( window.localStorage !== null ) {
  15.                 window.localStorage.gameDict = txt;
  16.             }
  17.  
  18.             // Let the rest of the game know
  19.             // that the dictionary is ready
  20.             dictReady( txt );
  21.         }
  22.         // TODO: Add error/timeout handling
  23.     });
  24. }

This gives us an incredibly efficient solution, allowing us to load the dictionary from a CDN (gzipped and with proper cache headers) – and avoiding subsequent requests to get the file if we already have it cached.

Improving Memory Usage

One final tweak that we can make. If you remember the previous dictionary lookup we loaded the entire dictionary into an object and then checked to see if a specific property existed. While this works, and is fast, it also ends up consuming a lot of memory (more so than the existing 916KB, at least).

To avoid this we can be a little bit tricky. Instead of putting the words into an object we can just leave the entire dictionary string intact and then do searches using a JavaScript String’s indexOf method.

Thus inside the dictReady callback we’ll have something like the following:

  1. function dictReady( txt ) {
  2.     dict = " " + txt + " ";
  3. }

and our revised findWord function will look something like this:

  1. // Finding and extracting words
  2. function findWord( letters ) {
  3.     // Copy all the tiles
  4.     var curRack = letters.slice( 0 ), word = "";
  5.  
  6.     // We're going to keep going through the available letters
  7.     // looking for a long-enough word
  8.     while ( curRack.length >= 3 ) {
  9.         // Find a "word" from the available tiles
  10.         word = curRack.join("");
  11.        
  12.         // ... and see if it's in the dictionary
  13.         if ( dict.indexOf( " " + word + " " ) >= 0 ) {
  14.             // We've found the word and we can stop
  15.             return word;
  16.         }
  17.  
  18.         // ... otherwise we need to remove the last letter
  19.         curRack.pop();
  20.     }
  21. }

All together, as of this moment, this is the most optimal solution to this particular problem that I can think of. It’s likely that I’ll find some additional tweaks (or ways of improving memory usage) in the future – but at the very least this solution keeps HTTP requests to a minimum, bandwidth usage to a minimum, memory usage to a minimum, and lookups fast.

Posted: March 15th, 2011


If you particularly enjoy my work, I appreciate donations given with Gittip.

59 Comments (Show Comments)



Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.


Secrets of the JavaScript Ninja

Secrets of the JS Ninja

Secret techniques of top JavaScript programmers. Published by Manning.

Ukiyo-e Database and Search

Ukiyo-e.org

Japanese woodblock print database and search engine.


John Resig Twitter Updates

@jeresig

Infrequent, short, updates and links.