‘Defeating Voice Captchas’


Jochem van der Vorm has written a elegant paper and proof of concept to how spammer could attack the voice based captchas being utilized in various web sites.’


‘The information has been provided by Jochem van der Vorm.
The original article can be found at: http://vorm.net/captchas/


For some years semi turing tests under the name of ‘captchas’ can be found on the web, to prevent bots from filling in forms. When Jochem first saw the visual variant he thought recognizing the characters with a computer algorithm should be easy. A bit of surfing and searching on the Internet learned him that he was right, most were broken already. Reinventing the wheel is not very useful, so he left the topic alone.

Later Jochem found a post about voice captchas. Since there was not too much information about this on the net and was bored (ill at home), he decided to give it a shot. Jochem started easy, willing to enhance the used algorithms to those used in speech recognition (like hmm, viterbi, baum-welch, entropy coding, etc.) when needed. This proved not to be necessary, the first feature complete (segmentation and matching) code worked relatively well on Microsoft’s captchas. Later he tweaked it a bit to also work on Google captchas.

On this page you can find proof of concept code to break voice captchas. Do not expect advanced software (pattern recognition science is so much further) or code that can be used in other projects, Jochem quited the project when it worked. Initially (February 2006) he kept the code on his harddisk, but later (may 2006) he published it (see disclosure motivation).

How does it work:
This is not a complete guide, but some pointers to the source (read it luke). As a starting point, consider the configtype struct:

typedef struct {
    int samplerate;
    int byterate;
    int winsize;
    int band_cnt;
    int word_length;
    int word_overlap;
    int threshold_energy;
    int file_offset;
    char trainfile[255];
} configtype;

The program starts with reading the audio file (in the header it could read the samplerate and byterate). file_offset bytes are skipped in the beginning of the file, because Google starts with a bell. The first step is that all samples are treated with a hamming window (arbitrary choice, most window types should do). The winsize is in samples (eg 512 samples on 8000 Hz provides a 64 ms window). Now the blocks are transformed into the frequency domain with a DFT After that the frequencies are put in band_cnt bins. These bins are not equally wide, the higher the frequency, the larger the band (this has to do with human
hearing (mel/bark scale), but Jochem doubts this is actually useful at the current incarnation of the program).

Now the program looks at the highest frequency bin. Every block that has more energy in a window than threshold_energy is considered a peak, and these peaks are used the segment the input file in the different spoken words. The word_length tells the program how many windows long a word is (so all words are considered the same length which is a current weakness of devoicecaptcha). word_overlap helps in localizing the peaks. When the locations of the words are know all frequency bins are written for word_length windows around the peaks. This is called the profile of the word.

The profiles for know words are put in trainfile. When a guess has to be made, the profiles for the words in the file are subtracted from those in the trainfile and the smallest deviation is chosen as the proper word. That is all.

The algoritms in devoicecaptcha are at this moment really naive. There are a lot of possible improvements. Perhaps in the future Jochem will enhance the program a bit, for now he thinks the 33% (as on Google’s captchas) is good enough.

Proof of concept:
The code is rather messy. Download devoicecaptcha.c (see at the bottom) and compile it with it:
gcc -lfftw3 -std=c99 devoicecaptcha.c

As you can see you need fftw, an allround Fourier transform library, which is packaged for most distributions, so you can be lazy (apt-get install fftw3-dev or similar).

When started with ./a.out captcha.wav you also need a data set (a msn one and a Google one are available. If you have downloaded the same captchas (see links) as Jochem has, it will print a guess on stdout.

As said before, devoicecaptcha works with a comparison to a trained set. To build up a training set and test the effectiveness of various parameters you can start devoicecaptcha with a third bogus argument, eg ./a.out captcha.wav –print.

What he did was download a large set of captchas with lwp and transcribed them with the proper words with something like:
for i in google/*.wav; do aplay $i &> /dev/null &; read j; mv $i google/$j.wav; done

He ended up with a directory with filenames like ‘123456.wav’ where 123456 are the digits spoken in the captcha. On this directory he unleashed a small ruby script, which splits the files in a training and testing set, builds a training set and tests the rest. This script can be found under train.rb.

MSN (passport) audio captchas are really weak. Only digits are used, there are always ten digits and the noise is weak and constant. The distance between the words is relatively constant. Devoicecaptcha guesses all ten digits correct on around the 75% of all cases, with a training set of about 40 files.

A data set which can be used for the English MSN (aka passport, aka MSN live) voice captchas (which Jochem got from passport.net) can be downloaded under the name msn.txt. It is also possible to create your own training data (see above).

Google’s voice captchas are more difficult to break than the captchas by Microsoft. Google employs different speakers, uses better noise artifacts and a random number of words. The dictionary is (as Microsoft’s) limited to digits only. The devoicecaptcha program scores around 33% on these voice captchas with a training set of 60 files. This is high enough for use in a bot.

A data set which can be used for the Google captchas (in English, Google also provides captchas in multiple languages) can be found under google.txt. The files were found at Google new account.

These captchas are also in use by blogger and blogspot for comments

If you know other voice captcha systems, let me know. Perhaps Jochem will have some time to look into them (and perhaps not). Jochem will at least add them to the links section on this page, so together with the provided source other people can try to beat them.

Disclosure motivation:
Jochem did not release the source code on this page without hesitation, because it might help spammers in their goals. And he hates spam. However there are some reasons he released the code anyway:
 * Jochem does not believe captchas actually work. Almost all visual captchas are broken already (read for example Microsoft’s paper on visual captchas or ez-gimpy which can defeat the ‘human insolvable’ captchas at yahoo). Or look at the versatile pwntcha. Although he does believe humans can fool computers for quite some time in the future, he suspects that computers can always beat computer generated challenges (without a ‘secret’ as in PKI).

 * Spam is a human problem and he believes human problems should be solved by humans. Legislation and law enforcement are in his opinion the best ways to deal with spammers. If captchas did no harm anyone this would of course not be enough reason for releasing this code, but

 * Captchas make the web a more difficult place for disabled people (see http://www.w3.org/ and more annoying for everyone. Jochem hopes the community will be motivated to find other solutions (and he is happy that the w3 organization cares for people with disabilities and a usable web for everyone, including the deaf-blind).

 * Jochem is a proponent of full disclosure.

Some people might ask what kind of solutions Jochem suggests for solving the spam problem. Spamassasin catches thousands of spam mails for me; it is expensive in cpu cycles (so putting spammers in jail is preferred), but the multi-tiered approach (neural network detection together with several lists of wrong-doers) works relatively well and can be applied to other forms of spam.

Playing the cat/mouse game with more difficult captchas, when the previous challenge is broken will work, but is not satisfactory in the end. Jochem encountered more human unsolvable captchas everyday. Jochem understands that corporations play this game however; in the real world thresholds do help.

 * Wikipedias entry on captchas is a very good starting point on this topic (as usual) and contains links to pages about breaking visual captchas

 * Inaccessibility of CAPTCHA by w3.org

 * Microsofts paper on visual captchas

 * PWNtcha – captcha decoder by sam hocevar

Broken voice captchas
 * Google signup

 * Microsofts passport service

 * blogger and blogspot also use google voice captchas, so these are broken to.

Not yet broken voice captchas
 * Killbot with audio (should be easy with standard voice recognition software)

 * Standards schmmandards proposal is adding random text like (‘You now hear numbers ..’) and using a speech engine with different voices. Jochem intuition tells him it is easy to beat, but the C# source is unusable on his Linux box, so he cannot prove. Generated examples are welcome.

Source code:
/* devoicecaptcha.c, jochem, http://vorm.net/captchas */

#define PI 3.141592653
#define EXP 2.718281828


typedef struct {
    int samplerate;
    int byterate;
    int winsize;
    int band_cnt;
    int word_length;
    int word_overlap;
    int threshold_energy;
    int file_offset;
    char trainfile[255];
} configtype;

static configtype msn = { 8000, 1, 256, 6, 10, 8, 8000, 46, ‘msn.txt’};
static configtype google = { 8000, 2, 512, 12, 8, 14, 2000000, 25500, ‘google.txt’};

/* Hamming window */
void hamming(double* data, int winsz)
    for(int i=0; i < winsz; ++i)
    *(data+i) = (0.54-0.46*cos(2.0*PI*i/winsz))*(*(data+i));

/* Precalc ‘semi-mel’ band boundaries in samples */
void setup_mel(int* bands, int cnt, int ws, int fs)
    double max_mel = 1127.01048*log(1.0+(fs/700.0));
    for(int i=1; i <= cnt; ++i)
    *(bands+i-1) = (int) (700*ws*(pow(EXP,((i*max_mel)/cnt)/1127.01)-1))/fs;

/* Sum over band */
void sum_over_bands(double* data, int* bands, double *sum, int cnt)
    int j = 1;
    for(int i=0; i < cnt; i++) {
    *(sum+i) = 0;
    while(j < *(bands+i)) {
        *(sum+i) += (double) abs(*(data+j));
    /* Normalize */
    if (i == 0) *(sum+i) = *(sum+i)/(*(bands+i));
    else *(sum+i) = *(sum+i)/(*(bands+i)/(*(bands+i-1)));

/* Simple compare for qsort */
int compare (const void * a, const void * b) { return ( *(int*)a – *(int*)b ); }

/* Naive peak detection, by comparing energy with threshold, max first */
/* using only info from band_idx */
int detect_peaks(double* values, int values_cnt, int** peaks, int width,
                 int threshold, int band_idx, int band_cnt)
    int detected = 0;
    int max = 0;
    int max_loc;
    /* Copy only band_idx data to prevent distorting values */
    double* band = (double*) malloc(values_cnt*sizeof(double));
    if (band == NULL) {
        perror(‘Out of memn’);
    for(int i=0; i < values_cnt; i++)
        *(band+i) = *(values+i*band_cnt+band_idx);
    while(1) {
        for (int i=0; i < values_cnt; i++) {
            if (*(band+i) > max) {
                max = *(band+i);
                max_loc =i;
        if (max < threshold) break;
        max = 0;
        *peaks = realloc(*peaks, sizeof(int)*detected);
        if (*peaks == NULL) {
            perror(‘Out of memn’);
        *(*peaks+detected-1) = max_loc;
        for(int j=max_loc-width/2; j < max_loc+width/2; j++)
            if (j >= 0 && j < values_cnt)
                *(band+j) = 0;
    qsort(*peaks, detected, sizeof(int), compare);
    return detected;

/* Basic usage information */
void show_usage(char *s)
    printf(‘Start with %s name.wavn’
    ‘to guess the words in name.wav.nn’
    ‘Start with %s name.wav –printn’
    ‘to print a profile for name.wav thatcan be used as ‘
    ‘training datan’, s,s);

/* Lazy open wav: – no header parsing */
void sniff_wavfile(char *s, configtype **cfg)
    FILE *fp = fopen(s, ‘rb’);
    if (fp == NULL) {
        perror(‘Could not open input filen’);
    unsigned char buf[50];
    if (! fread(buf, sizeof(buf), 1, fp)) {
        perror(‘File to smalln’);
    if (buf[34] == 8)
        *cfg = &msn;
    else if (buf[34] == 16)
        *cfg = &google;
    else {
        perror(‘Unknown input filen’);

/* Read file, window, dft and calc params */
int read_blocks(char* in, double** params, configtype *cfg)
    int samples_read=0;
    int blocks_read=0;

    FILE *fp = fopen(in, ‘rb’);
    if (fp == NULL) {
        perror(‘Could not open input filen’);
    fseek(fp, cfg->file_offset, SEEK_SET);
    /* Setup fftw and mel */
    double* audio_t = (double*) malloc(cfg->winsize*sizeof(double));
    double* audio_f = (double*) malloc(cfg->winsize*sizeof(double));
    int *bands = (int*) malloc(cfg->band_cnt*sizeof(int));
    if (audio_t == NULL || audio_f == NULL || bands == NULL) {
        perror(‘Out of memn’);
    fftw_plan p = fftw_plan_r2r_1d(cfg->winsize, audio_t, audio_f, FFTW_R2HC,

    /* Loop over file */
    short buf;
    while(! feof(fp)) {
        if (cfg->byterate == 1) {
            buf = fgetc(fp);
            *(audio_t+(samples_read++ % cfg->winsize)) = (double) buf;
        } else if (cfg->byterate == 2) {
            if (fread(&buf,1, sizeof(short), fp) < sizeof(short))
            *(audio_t+(samples_read++ % cfg->winsize)) = (double) buf;
        if((samples_read % cfg->winsize) == 0) {
            *params = (double*) realloc(*params, (blocks_read+1)*
            if (*params == NULL) {
                perror(‘Out of memn’);
            sum_over_bands(audio_f, bands, (*(params)+blocks_read*
                           cfg->band_cnt), cfg->band_cnt);
    } /* todo (perhaps ;-)) zeropad and handle last block */

    /* Free/close */
    return blocks_read;

/* Load trained set (messy) */
int load_trained(char* filename, int** profiles, int** words, int n_bands)
    int digit;
    int n_digits = 0;
    char buf[255];
    int lines = 0;
    FILE* trained = fopen(filename, ‘r’);
    if (trained == NULL) {
        perror(‘Could not open trained setn’);
    } else {
        while(fgets(buf, 255, trained) != NULL) {
            if (strstr(buf, ‘:’) != NULL) {
                if (sscanf(buf, ‘%d’,&digit)) {
                    *words = (int*) realloc(*words,n_digits*sizeof(int));
                    if (*words == NULL) {
                        perror(‘Out of memn’);
                    *(*words+n_digits-1) = digit;
            } else {
                *profiles = (int*) realloc(*profiles,lines*n_bands*sizeof(int));
                if (*profiles == NULL) {
                    perror(‘Out of memn’);
                char* t = strtok(buf, ‘ ‘);
                int cnt = 0;
                while(t!= NULL) {
                    if (++cnt > n_bands) break;
                    *(*profiles+cnt+(lines-1)*n_bands-1) = atoi(t);
                    t = strtok(NULL, ‘ ‘);
                if (cnt < n_bands+1) {
                    perror(‘Not enough columns in inputn’);
    return n_digits;

/* Get params around peaks only */
void separate_word_data(int* p, double* all, int* peaks, int n, int word_cnt,
                        configtype* cfg)
    int *p_ptr = p;
    for(int i=0; i < word_cnt; i++) {
        int loc = peaks[i];
        for(int i=loc-cfg->word_length/2; i < loc+cfg->word_length/2; i++) {
            if (i > 0 && i < n) {
                for(int j=0; j < cfg->band_cnt; j++) {
                    *(p_ptr) = (int) *(all+i*cfg->band_cnt+j);
            } else if(i < 0) {
                /* fixme: this is hack for digits starting too early*/
                for(int j=0; j < cfg->band_cnt; j++) {
                    *(p_ptr) = (int) *(all+j);
            } else {
                /* or late */
                for(int j=0; j < cfg->band_cnt; j++) {
                    *(p_ptr) = (int) *(all+j+n-1);

/* Print param profiles for seperated digits */
void print_params(int* params, int maxx, int maxy, int maxz)
    for(int i=0; i < maxx; i++) {
        for(int j=0; j < maxy; j++) {
            for(int k=0; k < maxz; k++) {
                printf(‘%d ‘, *(params+i*maxy*maxz+j*maxz+k));

int closed_match(int* word_data, int word_idx, int* trained_data,
                  int trained_sz, configtype *cfg)
    int sum;
    int pos;
    int it_lowest=999999;

    int* t_band = malloc(sizeof(int)*cfg->band_cnt);
    int* t_mean = malloc(sizeof(int)*cfg->band_cnt);
    if (t_mean == NULL || t_band ==NULL) {
        perror(‘Out of memn’);
    for (int i=0; i < trained_sz; i++) {
        memset(t_band, 0, sizeof(int)*cfg->band_cnt);
        memset(t_mean, 0, sizeof(int)*cfg->band_cnt);
        for (int j=0; j < cfg->word_length; j++) {
            for (int k=0; k < cfg->band_cnt; k++) {
                int orig = *(trained_data+i*cfg->word_length*cfg->band_cnt+
                int test = *(word_data+j*cfg->band_cnt+k+word_idx*
                *(t_band+k) += abs(orig-test);
                *(t_mean+k) += orig;
        sum = 0;
        int borked = 0;
        for (int l=0; l < cfg->band_cnt; l++) {
            if (*(t_mean+l) != 0)
                sum += (*(t_band+l)*cfg->word_length)/(*(t_mean+l));
                borked = 1;
        if (sum < it_lowest && ! borked) {
            it_lowest = sum;
            pos = i;
    return pos;

int main(int argc, char** argv)
    if (argc != 2 && argc !=3)

    /* Sniff filetype and set config for that type */
    configtype *cfg;

    /* Read file and give parames per winsize blocks */
    double* params = malloc(0);
    int blocks_read = read_blocks(argv[1], &params, cfg);

    /* Calculate positions of separate words */
    int* peak_locations = (int*) malloc(0);
    int word_cnt = detect_peaks(params, blocks_read, &peak_locations, cfg->word_length+
                        cfg->word_overlap, cfg->threshold_energy,cfg->band_cnt-1,cfg->band_cnt);

    /* Only interested in data near peaks, put in words */
    int* words = (int*) malloc(word_cnt*cfg->word_length*cfg->band_cnt*
    if (words == NULL) {
        perror(‘Out of memn’);
    separate_word_data(words, params, peak_locations, blocks_read, word_cnt, cfg);

    /* Print profile for training or lookup in trained set */
    if (argc == 2) {
        printf(‘Use training data from: %sn’, cfg->trainfile);
        printf(‘Detected %d wordsn’, word_cnt);
        printf(‘Guess is: ‘);

        /* Load trained datafile */
        int* trained_params = (int*) malloc(0);
        int* trained_ids = (int*) malloc(0);
        int t_size = load_trained(cfg->trainfile, &trained_params, &trained_ids,

        /* Now compare params with trained data per word*/
        for (int m=0; m < word_cnt; m++) {
            int match_idx = closed_match(words, m, trained_params, t_size, cfg);
            printf(‘%d’, *(trained_ids+match_idx));

        /* Freeing */
    } else if (argc == 3) {
        print_params(words, word_cnt, cfg->word_length, cfg->band_cnt);
    } else show_usage(argv[0]);

    /* Free and close stuff */

Categories: Reviews