OCR and Neural Nets in JavaScript


A pretty amazing piece of JavaScript dropped yesterday and it’s going to take a little bit to digest it all. It’s a GreaseMonkey script, written by ‘Shaun Friedle‘, that automatically solves captchas provided by the site Megaupload. There’s a demo online if you wish to give it a spin.

Now, the captchas provided by the site aren’t very “hard” to solve (in fact, they’re downright bad – some examples are below):

But there are many interesting parts here:

  1. The HTML 5 Canvas getImageData API is used to get at the pixel data from the Captcha image. Canvas gives you the ability to embed an image into a canvas (from which you can later extract the pixel data back out again).
  2. The script includes an implementation of a neural network, written in pure JavaScript.
  3. The pixel data, extracted from the image using Canvas, is fed into the neural network in an attempt to divine the exact characters being used – in a sort of crude form of Optical Character Recognition (OCR).

If we crack open the source code we can see how it works. A lot of it comes down to how the captcha is implemented. As I mentioned before it’s not a very good captcha. It has 3 letters, each in a separate color, using a possible 26 letters, and they’re all in the same font.

The first step is pretty clear: The captcha is copied into the canvas and then converted to grayscale.

  1. function convert_grey(image_data){
  2.   for (var x = 0; x < image_data.width; x++){
  3.     for (var y = 0; y < image_data.height; y++){
  4.       var i = x*4+y*4*image_data.width;
  5.       var luma = Math.floor(image_data.data[i] * 299/1000 +
  6.         image_data.data[i+1] * 587/1000 +
  7.         image_data.data[i+2] * 114/1000);
  8.  
  9.       image_data.data[i] = luma;
  10.       image_data.data[i+1] = luma;
  11.       image_data.data[i+2] = luma;
  12.       image_data.data[i+3] = 255;
  13.     }
  14.   }
  15. }

The canvas is then broken apart into three separate pixel matrices – each containing an individual character (this is quite easy to do – since each character is a separate color, they’re broken apart just based upon the different colors used).

  1. filter(image_data[0], 105);
  2. filter(image_data[1], 120);
  3. filter(image_data[2], 135);
  1. function filter(image_data, colour){
  2.   for (var x = 0; x < image_data.width; x++){
  3.     for (var y = 0; y < image_data.height; y++){
  4.       var i = x*4+y*4*image_data.width;
  5.  
  6.       // Turn all the pixels of the certain colour to white
  7.       if (image_data.data[i] == colour) {
  8.         image_data.data[i] = 255;
  9.         image_data.data[i+1] = 255;
  10.         image_data.data[i+2] = 255;
  11.      
  12.       // Everything else to black
  13.       } else {
  14.         image_data.data[i] = 0;
  15.         image_data.data[i+1] = 0;
  16.         image_data.data[i+2] = 0;
  17.       }
  18.     }
  19.   }
  20. }

Finally any extraneous noisy pixels are removed from the image (providing a clear character). This is done by looking for white pixels (ones that’ve been matched) that are surrounded (above and below) by black, un-matched, pixels. If that’s the case then the matching pixel is simply removed.

  1. var i = x*4+y*4*image_data.width;
  2. var above = x*4+(y-1)*4*image_data.width;
  3. var below = x*4+(y+1)*4*image_data.width;
  4.  
  5. if (image_data.data[i] == 255 &&
  6.     image_data.data[above] == 0 &&
  7.     image_data.data[below] == 0)  {
  8.   image_data.data[i] = 0;
  9.   image_data.data[i+1] = 0;
  10.   image_data.data[i+2] = 0;
  11. }

We’re getting really close to having a shape that we can feed into the neural network, but it’s not completely there yet. The script then goes on to do some very crude edge detection on the shape. The script looks for the top, left, right, and bottom-most pixels in the shape and turns it into a rectangle – and converts that shape back into a 20 by 25 pixel matrix.

  1. cropped_canvas.getContext("2d").fillRect(0, 0, 20, 25);
  2. var edges = find_edges(image_data[i]);
  3. cropped_canvas.getContext("2d").drawImage(canvas, edges[0], edges[1],
  4.   edges[2]-edges[0], edges[3]-edges[1], 0, 0,
  5.   edges[2]-edges[0], edges[3]-edges[1]);
  6.  
  7. image_data[i] = cropped_canvas.getContext("2d").getImageData(0, 0,
  8.   cropped_canvas.width, cropped_canvas.height);

So – after all this work, what do we have? A 20 by 25 matrix containing a single rectangle, drawn in black and white. Terribly exciting.

That rectangle is then reduced even further. A number of strategically-chosen points are then extracted from the matrix in the form of “receptors” (these will feed the neural network). For example a receptor might be to look at the pixel at position 9×6 and see if it’s “on” or not. A whole series of these states are computed (much less than the full 20×25 grid – a mere 64 states) and fed into the neural network.

The question that you should be asking yourself now is: Why not just do a straight pixel comparison? Why all this mess with the neural network? Well, the problem is, with all of reduction of information a lot ambiguity exists. If you run the online demo of this script you’re more likely to find the occasional failure from the straight pixel comparison than from running it through the network. That being said, for most users, a straight pixel comparison would probably be sufficient.

The next step is attempting to guess the letter. The network is being fed with 64 boolean inputs (collected from one of the extracted letters) along with another series of pre-computed values. One of the concepts behind how a neural network works is that you pre-seed it with some of the results from a previous run. It’s likely that the author of this script simply ran it again and again and collected a whole series of values to get an optimal score. The score itself may not have any particular meaning (other than to the neural network itself) but it helps to derive the value.

When the neural net is run it takes the 64 values that’ve been computed from one of the characters in the captcha and compares it against a single pre-computed letter of the alphabet. It continues in the manner assigning a score for each letter of the alphabet (a final result might be ‘A 98% likely’, ‘B 36% likely’, etc.).

Going through the three letters in the captcha the final result is devised. It’s not 100% perfect (I wonder if better scores would be achieved if the letter wasn’t turned into a featureless rectangle before all these computations) but it’s pretty good for what it is – and pretty amazing considering that it’s all happening 100% in the browser using standards-based technology.

As a note – what’s happening here is rather instance-specific. This technique *might* be able to work on a few more poorly-constructed captchas, but beyond that the complexity of most captchas just becomes too great (especially so for any client-side analysis).

I’m absolutely expecting some interesting work to be derived from this project – it holds a lot of potential.

Posted: January 23rd, 2009


If you particularly enjoy my work, I appreciate donations given with Gittip.

47 Comments (Show Comments)



Comments are closed.
Comments are automatically turned off two weeks after the original post. If you have a question concerning the content of this post, please feel free to contact me.


Secrets of the JavaScript Ninja

Secrets of the JS Ninja

Secret techniques of top JavaScript programmers. Published by Manning.

Ukiyo-e Database and Search

Ukiyo-e.org

Japanese woodblock print database and search engine.


John Resig Twitter Updates

@jeresig

Infrequent, short, updates and links.


via Ad Packs