CAPTCHA

A modern CAPTCHA. Rather than attempting to create a distorted background and high levels of warping on the text, this CAPTCHA focuses on making segmentation difficult by adding an angled line

A CAPTCHA is a type of challenge-response test used in computing to determine whether the user is human. "CAPTCHA" is an acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart", trademarked by Carnegie Mellon University. A CAPTCHA involves one computer (a server) which asks a user to complete a test. While the computer is able to generate and grade the test, it is not able to solve the test on its own. Because computers are unable to solve the CAPTCHA, any user entering a correct solution is presumed to be human. The term CAPTCHA was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper (all of Carnegie Mellon University), and John Langford (of IBM). A common type of CAPTCHA requires that the user type the letters of a distorted image, sometimes with the addition of an obscured sequence of letters or digits that appears on the screen. Because the test is administered by a computer, in contrast to the standard Turing test that is administered by a human, a CAPTCHA is sometimes described as a reverse Turing test.

Currently, reCAPTCHA (external site) is recommended by the CAPTCHA creators as an official CAPTCHA implementation.^[2]

Origin

The first discussion of automated tests which distinguish humans from computers for the purpose of controlling access to web services appears in a 1996 manuscript of Moni Naor from the Weizmann Institute of Science, entitled "Verification of a human in the loop, or Identification via the Turing Test". Primitive CAPTCHAs seem to have been later developed in 1997 at AltaVista by Andrei Broder and his colleagues to prevent bots from adding URLs to their search engine. Looking for a way to make their images resistant to OCR (Optical Character Recognition) attack, the team looked at the manual to their scanner, which had recommendations for improving OCR results (similar typefaces, plain backgrounds, etc.). The team created puzzles by attempting to simulate what the manual claimed would cause bad OCR. In 2000, von Ahn and Blum developed and publicized the notion of a CAPTCHA, which included any program that can distinguish humans from computers. They invented multiple examples of CAPTCHAs, including the first CAPTCHAs to be widely used (at Yahoo!).

Applications

CAPTCHAs are used to prevent automated software from performing actions which degrade the quality of service of a given system, whether due to abuse or resource expenditure. Although CAPTCHAs are most often deployed as a response to encroachment by commercial interests, the notion that they exist to stop only spammers is mistaken. CAPTCHAs can be deployed to protect systems vulnerable to e-mail spam, such as the webmail services of AOL Webmail,Gmail, Hotmail, and Yahoo. CAPTCHAs have also found active use in stopping automated posting to blogs or forums, whether as a result of commercial promotion, or harassment and vandalism. CAPTCHAs also serve an important function in rate limiting, as automated usage of a service might be desirable until such usage is done in excess, and to the detriment of human users. In such a case, a CAPTCHA can enforce automated usage policies as set by the administrator when certain usage metrics exceed a given threshold.

Characteristics

A CAPTCHA system is an automated means of generating new challenges which current computers are unable to accurately solve, but most humans can solve ^[3]. A CAPTCHA does not rely on the attacker never having seen the given type of CAPTCHA before. For example, a checkbox "check here if you are not a bot" might serve to distinguish between humans and computers, it is not a CAPTCHA because it relies on the fact that an attacker has not spent effort to break that specific form. To be a CAPTCHA, a system must be able to automatically generate new challenges that require artificial intelligence techniques to solve.

In practice, the algorithm used to create the CAPTCHA does not need to be made public, though it may be covered by a patent. Although publication can help demonstrate that breaking it requires the solution to a difficult problem in the field of artificial intelligence, deliberate withholding of the algorithm can increase the integrity of a limited set of systems (see security through obscurity). The most important factor in deciding whether an algorithm should be made open or restricted is the size of the system. Although an algorithm which survives scrutiny by security experts may be assumed to be more conceptually secure than an unevaluated algorithm, an unevaluated algorithm specific to a very limited set of systems is always of less interest to those engaging in automated abuse. Breaking a CAPTCHA generally requires some effort specific to that particular CAPTCHA implementation, and an abuser may decide that the benefit granted by automated bypass is negated by the effort required to engage in abuse of that system in the first place.

Accessibility

CAPTCHAs based on reading text — or other visual-perception tasks — prevent blind or visually impaired users from accessing the protected resource.^[4] People with learning disabilities involving text recognition might also find CAPTCHAs more difficult than the general population. Because CAPTCHAs are designed to be unreadable by machines, common assistive technology tools such as screen readers cannot interpret them. A visual CAPTCHA prevents access by blind users, and can hinder people with poor vision or color blindness. For this reason, some implementations of CAPTCHAs permit users to opt for an audio CAPTCHA.^[5] Even with a combination of visual and audio challenges, some users will be unable to use a CAPTCHAs, for example users with deafblindness.

To make their services fully accessible, some websites allow a manual registration process (eg, via email). In certain jurisdictions, failing to provide a universally accessible means of bypassing the CAPTCHA could make site owners a target of litigation. For example, a CAPTCHA may make a site incompatible with Section 508 in the United States.

The choice of adding a CAPTCHA to an application is a balance between ease of use for legitimate users and creating enough of a challenge for abusers that abusing the application is not worthwhile. The inconvenience caused by a CAPTCHA is sometimes higher for users with disabilities. For some applications, the potential for abuse is so high that the application author feels that a CAPTCHA is necessary. For other applications, the need for accessibility outweighs the abuse that a CAPTCHA would prevent.

Attempts at accessible CAPTCHAs

There have been various attempts at creating CAPTCHAs that are more accessible. Attempts include the use of JavaScript^[6], mathematical questions ("what is 1+1"), or "common sense" questions ("what color is the sky"). These attempts violate one or both of the principles of CAPTCHAs: either they cannot be automatically generated or they can be easily cracked given the state of artificial intelligence. As such, the only security these CAPTCHAs provide is security through obscurity; an attacker is unlikely to have encountered the formulation of the CAPTCHA in question, and unlikely to find it worth the time spending resources to break the CAPTCHA of a small site.

Circumvention

There are a few approaches to defeating CAPTCHAs: using cheap human labor to recognize them, exploiting bugs in the implementation that allow the attacker to completely bypass the CAPTCHA, and finally improving character recognition software.

Human solvers

CAPTCHA is vulnerable to a relay attack that uses humans to solve the puzzles. One approach involves relaying the puzzles to a sweatshop of human operators. According to one estimate, the operators could easily solve hundreds of them each hour. If the humans are dedicated employees who receive minimum wage this is not likely to be viable,^[7] but services like the Amazon Mechanical Turk have had success using micropayments to attract human problem-solvers for other tasks. Another variation of this technique involves copying the CAPTCHA images and using them as CAPTCHAs for a high-traffic site owned by the attacker. With enough traffic, the attacker can get a solution to the CAPTCHA puzzle in time to relay it back to the target site.^[8]

Insecure implementation

Design flaws in a CAPTCHA can allow bypass of the security measure, or could make an OCR-based attack easier to mount.

Some CAPTCHA protection systems can be bypassed without using OCR simply by re-using the session ID of a known CAPTCHA image. A correctly designed CAPTCHA does not allow multiple solution attempts at one CAPTCHA. This prevents the reuse of a correct CAPTCHA solution or making a second guess after an incorrect OCR attempt.^[9]
Using a hash (such as an MD5 hash) of the solution as a key passed to the client to validate the CAPTCHA. Often the CAPTCHA is of small enough size that this hash could be cracked.^[10] Further, the hash could assist an OCR based attempt. A more secure scheme would use an HMAC.
Using only a small fixed pool of CAPTCHA images. Eventually, when enough CAPTCHA image solutions have been collected by an attacker over a period of time, the CAPTCHA can be broken by simply looking up solutions in a table, based on a hash of the challenge image.

Computer character recognition

Although visual CAPTCHAs were originally designed to defeat standard OCR software designed for document scanning, a number of research projects have proven that it is possible to defeat many CAPTCHAs with programs that are specifically tuned for a particular type of CAPTCHA. For CAPTCHAs with distorted letters, the approach typically consists of the following steps:

Extraction of the image from the web page.
Removal of background clutter, for example with color filters and detection of thin lines.
Segmentation, i.e. splitting the image into segments containing a single letter.
Identifying the letter for each segment.

Removal of clutter is typically very easy to do automatically. In 2005, it was also shown that neural network algorithms have a lower error rate than humans in glyph identification.^[11] The only part where humans still outperform computers is segmentation^{[citation needed]}. If the background clutter consists of shapes similar to letter shapes, and the letters are connected by this clutter, the segmentation becomes nearly impossible with current software. Hence, an effective CAPTCHA should focus on the segmentation.

Neural networks have been used with great success to defeat CAPTCHAs as they are generally indifferent to both affine and non-linear transformations. As they learn by example rather than through explicit coding, with appropriate tools very limited technical knowledge is required to defeat more complex CAPTCHAs.

Some CAPTCHA-defeating projects:

Mori et al. published a paper in IEEE CVPR'03 detailing a method for defeating one of the most popular CAPTCHAs, EZ-Gimpy, which was tested as being 92% accurate in defeating it.^[12] The same method was also shown to defeat the more complex and less-widely deployed Gimpy program 33% of the time. However, the existence of implementations of their algorithm in actual use is indeterminate at this time.

PWNtcha has made significant progress in defeating commonly used CAPTCHAs, which has contributed to a general migration towards more sophisticated CAPTCHAs.^[13]

A number of Microsoft Research papers describe how computer programs and humans cope with varying degrees of distortion.^[14]

Image-recognition CAPTCHAs vs. character-recognition CAPTCHAs

With the demonstration (through research publications) that some character recognition CAPTCHAs are vulnerable to computer vision based attacks, some researchers have proposed alternatives to character recognition, in the form of image recognition CAPTCHAs which require users to identify simple objects in the images presented. The argument is that object recognition is typically considered a more challenging problem than character recognition, due to the limited domain of characters and digits in the alphabets of most natural languages.

An open source image recognition CAPTCHA system is available to users of the popular phpBB2 forum software in the form of an addon called KittenAuth^[15] which in its default form requires the user to select a stated type of animal from an array of thumbnail images but which can be customized for example to present images of interest to the forum's userbase.

Image recognition CAPTCHAs face many potential problems which have not been fully studied:

It is difficult for a small site to acquire a large dictionary of images which an attacker does not have access to. Without a means of automatically acquiring new labeled images, an image based challenge does not meet the definition of a CAPTCHA.

Some current image recognition CAPTCHAs ask the user to make a binary choice (is this a cat or a dog?). If there are n pairs of images to decide between, a randomly guessing bot has a 1/2^n chance of guessing the image correctly. Even with 16 images, a bot has a 1 in 65536 chance of getting the image right. This may require effort on the part of the CAPTCHA author to implement IP-address based throttling and defenses against botnets. Other schemes exist (for example, asking the user to choose one of a number of possibilities), however no such system has been created for study

Collateral benefits

Some of the original inventors of the CAPTCHA system have implemented a means by which some of the effort and time spent by people who are responding to CAPTCHA challenges can be harnessed as a distributed work system. This works by including "solved" and "unrecognized" elements (images which were not successfully recognized via OCR) in each challenge. The respondent thus answers both elements and roughly half of his or her effort validates the challenge while the other half is captured as work.

This reCAPTCHA system is being used to aid in the conversion of printed works (scanned images) into digital text. The approach is similar to one of the techniques by which CAPTCHA systems can be circumvented (in that the respondents are performing human intelligence to accomplish small amounts of work in a highly distributed way).

The reCAPTCHA maintainers estimate that existing CAPTCHA systems represent approximately 150,000 hours of labor per day that could be transparently tapped into via their revised system. That's approximately 75 years of normal, full-time work accomplished every day.

References

^ http://www.cs.sfu.ca/~mori/research/gimpy/
^ http://captcha.net/
^ http://www.captcha.net/
^ The W3C paper Inaccessibility of CAPTCHA outlined some of the accessibility problems with CAPTCHAs.
^ The article Proposal for an accessible Captcha describes how audio and visual test can be combined to increase accessibility in a Captcha.
^ http://www.protectwebform.com/smartcaptcha
^ "Hire People To Solve CAPTCHA Challenges". Petmail Design. 2005-07-21. Retrieved 2006-08-22.
^ Doctorow, Cory (2004-01-27). "Solving and creating CAPTCHAs with free porn". Boing Boing. Retrieved 2006-08-22.
^ "Breaking CAPTCHAs Without Using OCR". Howard Yeend (pureMango.co.uk). 2005. Retrieved 2006-08-22.
^ "Online services allow MD5 hashes to be cracked". Retrieved 2007-01-04.
^ Kumar Chellapilla, Kevin Larson, Patrice Simard, Mary Czerwinski (2005). "Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs)" (PDF). Microsoft Research. Retrieved 2006-08-02. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)
^ http://www.cs.berkeley.edu/~mori/gimpy/mori_gimpy.pdf
^ http://sam.zoy.org/pwntcha/
^ http://research.microsoft.com/~kumarc/
^ http://www.thepcspy.com/articles/security/the_cutest_humantest_kittenauth Thepcspy.com

External links

The CAPTCHA Project
The reCAPTCHA project
Inaccessibility of CAPTCHA: Alternatives to Visual Turing Tests on the Web, a W3C Working Group Note.
CAPTCHA History from PARC.
Google Tech Talk on Human Computation

Defeating CAPTCHAs

Breaking a Visual CAPTCHA (Gimpy) By Greg Mori and Jitendra Malik
Breaking CAPTCHAs without using OCR (talks about a common but easy to fix bug in programming CAPTCHAs, allowing session re-use)
OCR Research Team defeats weak CAPTCHAs.
PWNtcha - CAPTCHA decoder
Defeating a simple CAPTCHA with Open Source software
Will Solve Captcha for Money? - Article on Slashdot about using low-paid data entry workers to defeat CAPTCHAs in bulk.

[1] ttp://www.cs.sfu.ca/~mori/research/gimpy/

[2] ttp://captcha.net/

[3] ttp://www.captcha.net/

[4] The W3C paper Inaccessibility of CAPTCHA outlined some of the accessibility problems with CAPTCHAs.

[5] The article Proposal for an accessible Captcha describes how audio and visual test can be combined to increase accessibility in a Captcha.

[6] ttp://www.protectwebform.com/smartcaptcha

[7] "Hire People To Solve CAPTCHA Challenges". Petmail Design. 2005-07-21. Retrieved 2006-08-22.

[8] Doctorow, Cory (2004-01-27). "Solving and creating CAPTCHAs with free porn". Boing Boing. Retrieved 2006-08-22.

[9] "Breaking CAPTCHAs Without Using OCR". Howard Yeend (pureMango.co.uk). 2005. Retrieved 2006-08-22.

[10] "Online services allow MD5 hashes to be cracked". Retrieved 2007-01-04.

[11] Kumar Chellapilla, Kevin Larson, Patrice Simard, Mary Czerwinski (2005). "Computers beat Humans at Single Character Recognition in Reading based Human Interaction Proofs (HIPs)" (PDF). Microsoft Research. Retrieved 2006-08-02. {{cite journal}}: Cite journal requires |journal= (help)CS1 maint: multiple names: authors list (link)

[12] ttp://www.cs.berkeley.edu/~mori/gimpy/mori_gimpy.pdf

[13] ttp://sam.zoy.org/pwntcha/

[14] ttp://research.microsoft.com/~kumarc/

[15] ttp://www.thepcspy.com/articles/security/the_cutest_humantest_kittenauth Thepcspy.com

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]