Captcha: The Link Between Robots and Archives

This summer I had the pleasure of spending a weekend on a houseboat. The “RV of the Sea,” my friends lovingly called it. My fellow shipmates included several friends of friends who were new to me. In these types of situations, I am always interested to see the reaction to my explanation of what I do for a living. The reactions tend to vary from polite indifference to genuine interest and sometimes even fascination (these are my favourite people, of course). Typically, any conversation that ensues tends to focus on either the preservation of the very old or the very new, i.e. digital records in an ever-changing world of technological advancement.

However, on this particular occasion, I found myself in a fascinating conversation about – of all things – captcha screens. You know those windows that pop up and make you prove that you are not a robot?

A standard Captcha screen

Well, one of my RV of the Sea companions told me something I didn’t know – that captchas (“Completely Automated Public Turing Test to Tell Computers and Humans Apart” – catchy, eh?) are used as a way of outsourcing the verification and correction of digitized manuscripts!


“Buy a ticket to the ballgame, help preserve history”

So obviously, once I got back to land and a reliable Wi-Fi signal, I did a little digging to find out more about this. How did I not know about this before? In 2011, Guy Gugliotta wrote a great piece in the New York Times about how buying a ticket to a ball game meant that you were assisting in a project to transcribe “an old book, magazine, newspaper or pamphlet into an accurate, searchable and easily sortable computer text file.” I’ll forgive Mr. Gugliotta for using the clichéd journalist-writes-about-archives terminology (“old, musty text”) because he explained to the standard web user how he or she has helped to preserve historical knowledge.

A team of researchers at Carnegie Mellon University developed a suite of tools called reCaptcha and piloted them by addressing OCR mistakes in the digitized issues of the New York Times. Google then took this technology on board in order to authenticate the full texts it was digitizing for the Google Books project.  OCR technology has historically been terrible at deciphering hand-written text and as a result, digitization projects required a tonne of human intervention to re-type anything the program misunderstood.

Once a text has been scanned using two different OCR programs, the reCaptcha tools flag any disagreements between the two interpretations. As Gugliotta explains, “then each suspicious word is turned into a Captcha…the unknown word is then paired with a second Captcha word whose correct translation is already known.” The unknown word and the control word are shown to multiple users who are asked to decipher both of them, in exchange for their Disney on Ice tickets. Ok, well they have to pay money for the tickets, but they first have to prove they aren’t a robot in order to be given the privilege.


“A computer would deserve to be called intelligent if it could deceive a human into believing that it was human” – Alan Turing

We all know that technology is constantly shifting and improving, responding to current realities. Over the years, the reCaptcha service has had to evolve as the bots grew steadily more capable of deciphering distorted text. Google’s security blog posits that the most sophisticated bots can now decipher distorted text with a 99.8% accuracy rate. In response, they have upped their bot-proofing game, which has changed the look and feel of the captchas we all know and love.

Now you simply click a radio button to prove you have an organic brain. In contrast with the captchas of yore, the technology adapts to simple situations and accepts that no more verification is required. However, in some cases, the service determines that additional proof is necessary. In those cases, you will still be shown snippets of books, maps, and other documents and will help to increase the world’s digital knowledge. Or you will get to look at pictures of cute animals. So I guess it’s a win either way!

The new captcha technology from the reCaptcha suite of tools

“Guess what? I’m not a robot” 

You never know what you are going to learn aboard a floating house. I think the reCaptcha project is a wonderful example of what can be done when ingenuity and the desire to preserve knowledge are combined.

I will leave you with this great song by Marina and the Diamonds because I can only assume it was inspired by her love of captchas.