captcha scraping help

gutterseo

▬▬▬▬▬▬▬&
Feb 27, 2009
1,271
27
48
Howdy folks, here I am with yet another question that is really busting my balls. Worst thing is I think I tackled the same thing about two months ago and just can't remember wtf I done. I know this might not be the best forum to post this on but the other one is down and I know some of you guys are great at php.

A quick overview of the situation. I am using php/curl to try automate a webform.
The captcha url is always static ie. http://site.com/act/captcha.jpg
There is a token on the register page that also needs to be scraped.

However my problem lies herein. I scrape the register page and parse the token number. I then scrape the captcha url alas all my efforts are for naught as this image is not the same as the one originally displayed on the register page. I think I verified this as when I right click and view image on the original captcha it also changes the image when displayed.

What I think is happening is that when I request the captcha after already scraping the register page it is altering the cookie and giving me a new image. here is a quick sampling of the function I am using, the scrape_page function is shown below:

Code:
function get_token($site)
{
    // delete any remaining cookies
    if(file_exists("cookies/cookies.tmp"))
        unlink("cookies/cookies.tmp");
        
    $register_page = scrape_page($site, "https://site.com/act/register");
    preg_match("/name=\"token\" value=\"(.*?)\" \/>/", $register_page, $matches);
    preg_replace("/name=\"token\" value=\"/", "", $matches[0]);
    $token = $matches[1];

    $captcha_url = "https://site.com/act/Captcha.jpg";
    $fp = fopen("captcha/captcha.jpg", "w");
    fwrite($fp, scrape_page($captcha_url, "https://site.com/act/register"));
    fclose($fp);
    
    return $token;
}
Code:
function scrape_page($page, $reffer)
{
    // cookie path
    $file_cookie = "cookies/cookies.tmp";
    $ch = curl_init($page);

    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $file_cookie);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $file_cookie);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($ch, CURLOPT_REFERER, $reffer);
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)");


    $response = curl_exec($ch);
    curl_close($ch);

    //echo curl_error($ch);
    return $response;
}
Major kudos to anyone who can help me solve this.
 


...
I think I verified this as when I right click and view image on the original captcha it also changes the image when displayed.
....

Well obviously so, since the captcha url is designed (in most cases) to automatically generate a new number/code upon each request, and typically what happens is the same code gets stored as a session value, ergo the the code retrieved is only good at the time of that request and during the session availability time. You would have to actually save the captcha image, and not make another request to it during the session time in order to keep the code the same as that in the image.

And if you're speaking of server vs your local machine good chance you went over the session period.

PS: Captcha values are rarely stored in a client-side cookie, but rather server sided session. (even if it was stored in the cookie I noticed you pulled the captcha url after looking for a token in the cookies).
 
PS #2: In case you didn't know, the url http://site.com/act/captcha.jpg can appear static but is actually a server sided language such as php behind a mod_rewrite rule that makes it appear as static content, but is actually a dynamic script sending back an image header and data.

Code:
/* Bunch of stuff here to generate and draw the image in memory */
header("Content-Type: image/jpeg"); 
ImageJpeg($image);

essentially.
 
^^ this.

You can not request the same captcha image twice and get the same result.
 
Thanks for the replies guys, still looking for a workaround. Would a multi curl to scrape the two pages (reg page and captcha image) simultaneously work to my advantage or is it something I'm missing.

@kblessinggr the token value is a hidden value on the register page. I looked and its not in the client side cookies.

I also tried a workaround sending the same token with a known solved captcha and no joy.
 
Actually, when cURL scrapes the registration page it's not sending another HTTP request to get the image (your browser would, however, to display it inline), so it's not invalidating your request by regenerating it, as far as the website knows you're just requesting the image fluidly (as if you were actually a browser).

The exact problem with your example I'm not sure about, but I would try playing around with only setting CURLOPT_COOKIEFILE when you're scraping the captcha.
 
  • Like
Reactions: gutterseo
Problem solved. In the end it had nothing to do with cookies but solely down to writing bad code and requesting the captcha image twice DOH. This is why it wasn't matching up with the token.

Thanks for the input anyway.

/facepalm