curl, php & proxies are killing me...

enderz

New member
Jan 13, 2009
727
2
0
Well... I have a general problem, but I will give a specific example and maybe some of the pros here could help me out...

Lets say that I have a list of 1000 proxies, I wrote a simple php script that will curl some page and I check if I got back the page (I check a site that blocks multiply attempts to make sure I dont use tranparent proxy).

Any way it works fine... but randomly hangs! which means the script just stop working/freezing, it can happen after 40 proxies I check or after 200, what I usually do is to delete the proxies that were checked and hit F5.

Now... this is a general probem, cause it happens to me afterwards when I use the proxies I check to curl some places, it curls for a while and then hangs!!! WTF???

For the tech freaks here, this is how I do it, all the curling is located in a specific class, and the options are set like this:

$proxy=$this->getProxy(); //will get a random proxy from the list
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);


curl_setopt($ch, CURLOPT_PROXY, $proxy);

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_TIMEOUT, $timeout);
curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
curl_setopt ($ch, CURLE_OPERATION_TIMEOUTED, $timeout);
curl_setopt ($ch, CURLOPT_DNS_CACHE_TIMEOUT, $timeout);

curl_setopt($ch, CURLOPT_REFERER, $referer);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_FORBID_REUSE, true);

curl_setopt($ch, CURLOPT_FAILONERROR, false);
curl_setopt($ch, CURLOPT_DNS_USE_GLOBAL_CACHE, false);

$result['EXE'] = curl_exec($ch); //here it hangs!!!!!!!!!!!!!!!!!!!!!!!!!
$result['ERR'] = curl_error($ch);
$result['INF'] = curl_getinfo($ch);

curl_close($ch);

Any Ideas???
 


I assume your using a regex to check if you got the page yeah? Have you tried running the script from the command line and not in a browser. I'm no curl guru just throwing suggestions.

More than likely if they are scraped proxies there just shitty. They may work one minute then die the next.
 
Any way it works fine... but randomly hangs! which means the script just stop working/freezing, it can happen after 40 proxies I check or after 200, what I usually do is to delete the proxies that were checked and hit F5.

What's your timeout set at?

Try running a loop using the same proxy over and over, you may just be hitting a dead proxy and not have any code to handle it.
 
Your code should be written so that getProxy() returns live proxies only.
And test just that method to make sure it handles dead proxies.
 
ok.

All the proxies enterd to my DB are checked and are alive. BUT... in the check itself sometimes the script hangs, and of course I cant tell in advance if the proxy already dead since Im in the checking itself...

After the checking when Im using the proxies that passed the check the script hangs, maybe cause a certain proxy went dead... how can I tell? how can I check it without letting the script freeze?

getproxy theoretically should return a random live proxy, but again, if it went dead I cant tell without checking, and again... the script sometimes hangs, maybe when it checks dead proxies (and all of this is theoretical maybe it hangs on non dead proxies and there is other explanation???).

The timeout is varied depends on the function using it, I put 4-10 seconds, but Im not sure this is the problem cause the script returns timeout and it is fine by me, the problem is when it returns nothing and just hangs.

So what to do???
 
ok.

the problem is when it returns nothing and just hangs.

So what to do???

Add some code to check $result for empties and pass it back only if there's something to pass. Otherwise, continue to the next iteration of your loop (I'm working on the assumption that this is inside a loop and you're actually doing something with the contents of $result).

Code:
if(!empty($result['EXE'])) {
    return $result;
    }
    else {
    echo "getpage failed";
    }
 
If you're having problems with CURL, it's not that hard to just write your own basic version
Open a normal socket, connect to port 80. All an HTTP proxy request looks like through a proxy is
"GET "http://".$server."/".$page." HTTP/1.1\r\nHost:".$server."\r\n\r\n";
where $server is the domain you're trying to load.
the only difference between it and a standard HTTP request is that it contains the domain name in the first line of the get request.
 
Wow... thanks guys for the help.

Some anwers:
1. im using browser
2. Scubaslick, the script hangs on : curl_exec($ch), so checking the result will do no good.
3. cmxp... hmmm... good idea, although Im really pissed on this frustrating issue. Would love to solve it and not going around, specially it doenst look like a general problem.

In java I would create a seperate thread and if it didnt return after x second I would kill it and retry, is there something similar for php?
 
Wow... thanks guys for the help.

Some anwers:
1. im using browser
2. Scubaslick, the script hangs on : curl_exec($ch), so checking the result will do no good.
3. cmxp... hmmm... good idea, although Im really pissed on this frustrating issue. Would love to solve it and not going around, specially it doenst look like a general problem.

In java I would create a seperate thread and if it didnt return after x second I would kill it and retry, is there something similar for php?
Why not run a seperate java program that opens a port and listens for ip: port. Then returns a response.
You could just connect to localhost and let Java do the heavy lifting. I do something similar for captchas.

In all honesty this would give you much better multithreading than PHP. curl_multi is where memory goes to die.
 
xmcp... I guess you are right, my problem is lazyness, I havent coded in java for a while, and hate to create dev environment for it etc., but if I won't solve it in php, thats what I will do :(

Onejust - it hangs forever...
 
hacked up my own solution for this earlier today, not php/curl but it works if you're lucky enough to be using a *nix based OS ;) (assuming proxies are format ip:port\n (newline) format in proxies.txt)

Code:
rm -rf workingproxies.txt
rm -rf checker.sh
head proxies.txt | sed -e 's/:/ /g' | awk '{print $1}' | sed -e 's/^/ping -c1 -s1 /g' | sed -e 's/$/ \| grep "bytes from" >> workingproxies.txt/g' >> checker.sh
chmod +x checker.sh
./checker.sh 
cat workingproxies.txt | sed -e 's/:/ /g' | awk '{print $4}' > temp.txt
mv temp.txt workingproxies.txt
if your list is pretty big, use tail -f workingproxies.txt to get the working ones as they come out
 
Seems as if curl is dying on a particular proxy thats dead. You need some code to abort mission if it takes too long, perhaps there is another timeout type variable you arent using, connect timeout, download time out etc?

OR it seems there is a curl or php bug, try upgrade curl / php

Also I see you have curl_setopt($ch, CURLOPT_FAILONERROR, false);

Set to true? As long as you can still detect a proxy is good or not who cares if it curl cries on error, just handle it
 
Thx...

I will turn to true this parmater, it was true, but during my trial and error I have set it to false.

I think I have put all the timeour vars that exist if someone thinks I havent Id love to know which one...

And I don't think it is a dead proxy since it happends randomly and not in a particular proxy.
 
My workaround:

although I got here some really good advices, I have decided to stick with php and curl. So for those of you who scrape by the millions and have this hang problem here what I did to solve it:
1. php page the do the scraping and getting all the data it needs from get/post - this is the "shit" that hangs.
2. the generic php class you call from where ever you need, it curls #1.

The nice thing is that when #1 hangs then #2 gets a timeout error which is great.
Thats it, problem solved :)