One of the problems most often cited with a Captcha-based submission verification system is the lack of accessible options; a plain-text alternative can't be provided, for the simple reason that it gives an easy route to circumvent the Captcha and thus defeat the point of putting such a system in place.
Any alternative means of finding out what the Captcha image shows must be accessible for people who can't view the image, while also presenting a level of difficulty for automated and spam submissions; an option that meets both of these criteria is the audio Captcha.
The concept
The idea behind an audio Captcha is simple: in addition to providing the Captcha image on-screen, a sound file representing the image is made available. This caters for most users who would otherwise be unable to enter the Captcha text. This sound file can be a simple RIFF wave file, but is more often encoded into a speech codec or the ubiquitous MP3 format.
In this article, I'll be looking at the implementation of a simple MP3 audio Captcha, which takes a short string of a few characters and creates a sound file. I'll assume for this article that it's only made up of lowercase letters; there are no digits or uppercase letters, and no punctuation, in order to keep things at a minimal level. The audio Captcha algorithm is based on a series of sound files, each representing one letter, which can then be concatenated into a representation of the whole string.
In the ideal case, it would be simple to take the contents of each file and run them together into one large file, by writing out the contents of the files one after the other. This would be a trivial concatenation process, but will unfortunately not work. For the reason behind that, it's important to look at what makes up a RIFF wave file.
The RIFF file format
A RIFF wave file is more than a basic recording of the digitised waveform; in addition to the waveform data, metadata is attached regarding the size of the data and its origin.
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
Chunk header: "RIFF" | |||
RIFF chunk size (file size-8) | |||
Chunk header: "WAVE" | |||
Subchunk header: "fmt " | |||
Format chunk size | |||
Format (1=PCM) | Channel count | ||
Sampling rate (Hz) | |||
Bytes per second | |||
Block alignment value | Bits per sample | ||
Subchunk header: "data" | |||
Data size | |||
File data | |||
The table above shows the format of the simplest RIFF wave file. The format is capable of holding information about wave files intended for MIDI samplers, cue points for mixing, and various other additions; most wave files will not contain these, and will simply be a record of the waveform data with a header attached.
As can be seen, the wave file specifies not only the length of the digitised waveform, but also its sampling rate and channel count. A telephone-level wave file can easily be distinguished from a CD-quality file, by simply checking the sampling rate; in a similar manner, stereo waveform files and monoscopic files can be differentiated. The provision of this metadata about the file is the reason for the attachment of the header, since otherwise a sound player application would have no idea of the process for playing the sound file.
Unfortunately, this means that simple concatenation of two RIFF files won't result in a longer RIFF file. A sound player will read the headers at the start of the file, which indicate the length of the first segment to be concatenated, and play that segment; at this point, a reasonable player will deduce that the end of file has been reached, since its record of played samples is the same as the number indicated in the file header, and won't play any more of the file.
The solution to this problem is to use a more complex concatenation: instead of simply throwing the files together, they will need to be run through an external sound processor.
External sound processors
The sox
command is a simple interface to an audio concatenation and processing tool, which can be used for this audio Captcha. If each letter's wave file is passed into sox, a wave file can be output consisting of all the input files together, with an updated format header containing the total data size and overall sampling rates. An example invocation would run as follows:
Invocation of sox: An example concatenation
sox a.wav x.wav m.wav b.wav -t .wav axmb.wav
Since each letter is contained in its own wave file, it's a trivial matter to break up the Captcha text string and build a command line for sox
to use. The following example assumes that the Captcha script is written in PHP, and the text is held in the session data after generation.
Building the concatenated wave file
$parts = array(); for($i = 0; $i < strlen($_SESSION['captcha']); $i++) $parts[] = $_SESSION['captcha'][$i] . '.wav'; exec(sprintf('sox %s -t .wav %s.wav', join(' ', $parts), $_SESSION['captcha']));
What this doesn't do is generate an MP3 representing the Captcha text; for that, an MP3 encoder is required. lame
allows for the encoding of MP3s at various sampling rates, but will normally take its sampling information from the input file. Since, as detailed above, a wave file contains detailed information about sampling and formatting, lame
is able to use this to generate an MP3 file.
The example below is a slight modification of the sox
invocation above, in order to pipe the output to lame
and encode an MP3 file, and then to serve the MP3 out as a downloadable file.
Building a Captcha MP3
$parts = array(); $c = $_SESSION['captcha']; for($i = 0; $i < strlen($c); $i++) $parts[] = $c[$i] . '.wav'; exec(sprintf('sox %s -t .wav - | lame - %s.mp3', join(' ', $parts), $c)); header('Content-type: audio/mpeg'); header('Content-length: '.filesize("{$c}.mp3")); header('Content-disposition: attachment; name="'.$c.'.mp3"'); passthru("{$c}.mp3");
An example of this script's usage in a Captcha would be as follows.
Possible enhancements
In the above example, clearly voiced phrases have been used for the constituent letters of the audio Captcha. This provides a good level of accessibility, but compromises the security of the audio Captcha: any automatic circumventions will easily be able to work out the letters that make up the audio file. One solution to this is to overlay a level of noise on the audio file, to provide some level of obfuscation to the output; in addition to this, periods of silence can be inserted between the letter waveforms, making the output less regular.
Another enhancement that can be made to the audio Captcha output is to provide more formats for the file. At present, the audio Captcha is generated in RIFF wave and MP3 formats; provision for Windows audio and Ogg formats would allow for more widespread usage of the output file.
Imran Nazar <tf@imrannazar.com>, Jan 2010.