cURL Question

nvanprooyen

Fortes Fortuna Adiuvat
Dec 8, 2008
3,189
82
0
Orlando, FL
This is probably easy, but say I needed to scrape something on a page that I couldn't direct link to because it requires a session ID (e.g. a shopping cart). I can however browse to the product detail page with the add to cart button on it. If I go to this page, is it possible to "click" on this link so a session is created and I can grab data off the add to cart page? Sorry if this is a noob ass question. I looked around for a few little while and couldn't find anything on it. Probably just don't know what to search for.
 


Yes, it's possible. Disable cookies in your browser, go to the main page, then see if the session id is passed in the url of the links on the page. If not, you'll have to enable cookies within curl with CURLOPT_COOKIEFILE and CURLOPT_COOKIEJAR. If you're confused, post up the URL, or PM it to me and I'll show you how.
 
I already have cookies enabled like this:

//cookies
curl_setopt($ch, CURLOPT_COOKIESESSION, TRUE);
curl_setopt($ch, CURLOPT_COOKIEFILE, $cookie);
curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie);

does that look right?
 
My main question is how do I go from page 1 to page 2. I've never used it to "crawl" anything or navigate around. I'm sure I'm getting the terminology wrong, but hopefully you know what I'm saying. Everything I've done to date is just feeding in a page, taking what I need and then moving on.
 
I've never had to use CURLOPT_COOKIESESSION before, but yeah... that looks right as long as $cookie is a set variable that contains a path to a file that is writable by apache (or whatever your webserver is).
 
I've never had to use CURLOPT_COOKIESESSION before, but yeah... that looks right as long as $cookie is a set variable that contains a path to a file that is writable by apache (or whatever your webserver is).
Yep, it is. Writable cookie folder outputting to myCookie.txt.
 
Just do two curl requests.

Code:
<?php

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.domain.com/");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
$firstpage = curl_exec($ch);
curl_close($ch);

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "http://www.domain.com/cart.php");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, '/tmp/cookies.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, '/tmp/cookies.txt');
$secondpage = curl_exec($ch);
curl_close($ch);

?>
You can do it in the same curl session and just change the url, but at times in the past I've had issues with it not working correctly the second time around, so I always use a new curl session for each request.
 
If you are trying to emulate clicking a button and adding something to a shopping cart, you're most likely going to have to initiate either a GET/POST request to the server with some parameters (at the very least the item you are adding to the cart). This is what will actually set the session data for you. Check out Charles Web Debugging Proxy • HTTP Monitor / HTTP Proxy / HTTPS & SSL Proxy / Reverse Proxy and while its recording walk through all of the steps that you want to curl out. The app will give you the exact parameters and methods used to perform the action that you want and you can just copy them over to curl.
 
  • Like
Reactions: nvanprooyen
illeat, nice tool!

navanprooven, illeat is right. If you need to submit a form, you need to do a POST within curl with CURLOPT_POST, and CURLOPT_POSTFIELDS, or a GET by simply appending the form keys and values to the url you're requesting.

Live HTTP Headers (a firefox plugin) is also a very popular tool for tracing headers.

Just in case you haven't found it, PHP: curl_setopt - Manual is a great resource for curl options.

And here is a basic tutorial on posting with curl... Sending POST form data with php CURL
 
I've said it before, I'll say it again: If you're not writing your own class to manage curl requests, you're wasting time.
From inside a class:

PHP:
	function init()
	{
		$this->ch=curl_init();
		curl_setopt($this->ch, CURLOPT_TIMEOUT, 5);
		if(!file_exists($_SERVER['DOCUMENT_ROOT']."/cookies/"))
		{
			mkdir($_SERVER['DOCUMENT_ROOT']."/cookies/");
		}
		$this->id=rand(0,100000);
		$useragent="Not giving away my useragent, find your own"; 
		curl_setopt($this->ch, CURLOPT_USERAGENT, $useragent); 
		curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false); 
		curl_setopt($this->ch, CURLOPT_RETURNTRANSFER  ,1); 
		curl_setopt($this->ch, CURLINFO_HEADER_OUT, true);
		curl_setopt($this->ch, CURLOPT_COOKIEJAR, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$this->id.".txt"); 
		curl_setopt($this->ch, CURLOPT_COOKIEFILE, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$this->id.".txt"); 
		curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION  ,1); 
		curl_setopt($this->ch, CURLOPT_HEADER, 0); 
		curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false);  
		curl_setopt($this->ch, CURLOPT_CONNECTTIMEOUT, 3);
		curl_setopt($this->ch, CURLOPT_TIMEOUT, 5);
		curl_setopt($this->ch, CURLOPT_AUTOREFERER, 0);
		curl_setopt($this->ch, CURLOPT_REFERER, "");
		$this->dataBuffer=array();
		
	}
	function setCookiesByID($mid)
	{
		$this->id=$mid;
		curl_setopt($this->ch, CURLOPT_COOKIEJAR, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$mid.".txt"); 
		curl_setopt($this->ch, CURLOPT_COOKIEFILE, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$mid.".txt"); 
	}
 
If the link you need to click has a dynamic id attached to it (not sure if this is your problem) you will need to request the page it appears on first and then parse the link and then curl that.

btw. what is up with the horrible syntax highlighting for php in the forum, its way easier to read just wrapping it in quote tags.
 
I've said it before, I'll say it again: If you're not writing your own class to manage curl requests, you're wasting time.
From inside a class:

PHP:
	function init()
	{
		$this->ch=curl_init();
		curl_setopt($this->ch, CURLOPT_TIMEOUT, 5);
		if(!file_exists($_SERVER['DOCUMENT_ROOT']."/cookies/"))
		{
			mkdir($_SERVER['DOCUMENT_ROOT']."/cookies/");
		}
		$this->id=rand(0,100000);
		$useragent="Not giving away my useragent, find your own"; 
		curl_setopt($this->ch, CURLOPT_USERAGENT, $useragent); 
		curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false); 
		curl_setopt($this->ch, CURLOPT_RETURNTRANSFER  ,1); 
		curl_setopt($this->ch, CURLINFO_HEADER_OUT, true);
		curl_setopt($this->ch, CURLOPT_COOKIEJAR, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$this->id.".txt"); 
		curl_setopt($this->ch, CURLOPT_COOKIEFILE, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$this->id.".txt"); 
		curl_setopt($this->ch, CURLOPT_FOLLOWLOCATION  ,1); 
		curl_setopt($this->ch, CURLOPT_HEADER, 0); 
		curl_setopt($this->ch, CURLOPT_SSL_VERIFYPEER, false);  
		curl_setopt($this->ch, CURLOPT_CONNECTTIMEOUT, 3);
		curl_setopt($this->ch, CURLOPT_TIMEOUT, 5);
		curl_setopt($this->ch, CURLOPT_AUTOREFERER, 0);
		curl_setopt($this->ch, CURLOPT_REFERER, "");
		$this->dataBuffer=array();
		
	}
	function setCookiesByID($mid)
	{
		$this->id=$mid;
		curl_setopt($this->ch, CURLOPT_COOKIEJAR, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$mid.".txt"); 
		curl_setopt($this->ch, CURLOPT_COOKIEFILE, $_SERVER['DOCUMENT_ROOT']."/cookies/cookie-jar-".$mid.".txt"); 
	}

But we thought we would let you do all the work, then jump you to get your class that you've already written.

Sounds like a plan to me.

(there are a number of classes floating around out there if you don't want to get violent with shady and want a bit of a time saver)
 
and shady emerges...
I just can't resist curl!
Sorry to thread-jack, but does anyone have a multicurl class that's well commented I could take a look at? The PHP.net documentation sucks.
I can show part of it. I don't want to show too much structure though, so I had to edit it.

curl_multi is a slow convoluted piece of shit that fails a lot more as you increase the number of connections though. I don't use this very often.

PHP:
function sendRequest()//all handles should already be configured/initialize
{
	
		$this->mh = curl_multi_init();//initialize multi-curl
		for($i=0; $i<sizeof($this->xc); $i++)
		{
			curl_multi_add_handle($this->mh,$this->handles[$i]);//add pre-existing handles to mh
		}
		$running=null;//not yet running
		do
		{
		    curl_multi_exec($this->mh,$running);//execute all
		} while ($running > 0);
		for($i = 0; $i < sizeof($this->xc); $i++)//loop through handles
		{
			$tmp = curl_multi_getcontent  ( $this->handles[$i]  );
			//$tmp now stores your content. Do with it what you will.

		}
		//magic happens
}
 
curl_multi is a slow convoluted piece of shit that fails a lot more as you increase the number of connections though. I don't use this very often.

What's an alternative? I've been learning Ruby on the side for this reason. With single curl calls across lists of 10,000 URLs, scripts have been taking 15+ minutes to run.
 
What's an alternative? I've been learning Ruby on the side for this reason. With single curl calls across lists of 10,000 URLs, scripts have been taking 15+ minutes to run.

Run multiple instances of your non-curlmulti script.

Side note: You can do a lot better than CURLOPT_USERAGENT with CURLOPT_HEADERS to make your script look like a real browser.