I wrote up a little script a few days ago to grab SERPs results from Yahoo!. I had another script which just queried their regular search page, but I was getting temp IP banned after not too long, so I knew I had to get an API key. I figured I would post this in here so it would get more views and hopefully some folks find it useful.
There are obviously better ways to do this, such as using one of the xml processing classes. This way was the quickest for me to do with my (limited) scripting skill set. I am going to rewrite this when I get the chance to make it more structurally sound. For instance, you can't return 101 results, only multiples of 10. Also, for your purposes, you might want to only output a list of the URL's (as I normally want) or maybe write them to a DB. For this, just comment out the other lines that are not needed.
I want to add these options to the console. I could also clean up the code and parse the XML in a more sophisticated way, with one of the pre existing classes.
This script makes use of the curl extension for PHP (I am on 5, but it should work fine with 4). So if you are using a local install of PHP, you will need to go in to your php.ini file (the right one) and uncomment the line with curl. If you are using a webhost, they may or may not allow curl.
One more quick thing, you will need to sign up for a Yahoo! API key, and this is only good for 5K calls per day. Enjoy!
edit: wow, i forgot how completely unreadable the php code is - the background should really be changed. here is the script, attached as a txt file
There are obviously better ways to do this, such as using one of the xml processing classes. This way was the quickest for me to do with my (limited) scripting skill set. I am going to rewrite this when I get the chance to make it more structurally sound. For instance, you can't return 101 results, only multiples of 10. Also, for your purposes, you might want to only output a list of the URL's (as I normally want) or maybe write them to a DB. For this, just comment out the other lines that are not needed.
I want to add these options to the console. I could also clean up the code and parse the XML in a more sophisticated way, with one of the pre existing classes.
This script makes use of the curl extension for PHP (I am on 5, but it should work fine with 4). So if you are using a local install of PHP, you will need to go in to your php.ini file (the right one) and uncomment the line with curl. If you are using a webhost, they may or may not allow curl.
One more quick thing, you will need to sign up for a Yahoo! API key, and this is only good for 5K calls per day. Enjoy!
PHP:
<?php
// Request Yahoo! REST Web Service using
// curl and PHP 4 / 5
// Returns a set number of results
// Author: lucab
// August 1, 2007
// I got the HTTP status error handling part from a script on http://developer.yahoo.com
########################################################################################
##### ATTENTION ########################################################################
##### THIS IS THE ONLY THING YOU NEED TO DO TO GET THIS SCRIPT WORKING!!!!!!!! #########
########################################################################################
$api = ''; //enter your Yahoo! API key here or this script will not work!!!!!!
//some general housekeeping
error_reporting(E_ALL);
set_time_limit(0);
?>
<h1>Yahoo! SERPs Parser with Yahoo! API</h1>
<form action="<?php $_SERVER['PHP_SELF'] ?>" method="post">
Enter your query here: <input type="text" name="query" /><br />
Enter # of results: <input type="text" name="results" /><br />
<input type="submit" />
</form>
<?
if ($_SERVER['REQUEST_METHOD'] == 'POST'){
//let's define our api key, what we are searching for, and how many results we want to return
if (isset($_POST['query'])){
$query = $_POST['query'];
}
if (isset($_POST['results'])){
$num = $_POST['results'];
}
//here we figure out how we are going to loop through all the results
if ($num <= 100){
$results = $num;
$i = 1;
}elseif ($num > 100 && $num <=1000){
$results = 100;
$i = round($num / 100);
}else{
die("<h3>You cannot get more than 1000 results at a time!</h3>");
}
//here is our loop to go through 100 etc...
for ($j = 0; $j < $i; $j++){
$start = $j * 100 + 1;
if ($start > 1000){
die;
}
// The Yahoo! Web Services request
$request = 'http://api.search.yahoo.com/WebSearchService/V1/webSearch?appid='.urlencode($api).'&query='.urlencode($query).'&start='.urlencode($start).'&results='.urlencode($results);
// Initialize the session
$session = curl_init($request);
// Set curl options
curl_setopt($session, CURLOPT_HEADER, true);
curl_setopt($session, CURLOPT_RETURNTRANSFER, true);
// Make the request
$response = curl_exec($session);
// Close the curl session
curl_close($session);
// Get HTTP Status code from the response
$status_code = array();
preg_match('@\d\d\d@', $response, $status_code);
// Check the HTTP Status code
switch( $status_code[0] ) {
case 200:
// Success
break;
case 503:
die('Your call to Yahoo Web Services failed and returned an HTTP status of 503. That means: Service unavailable. An internal problem prevented us from returning data to you.');
break;
case 403:
die('Your call to Yahoo Web Services failed and returned an HTTP status of 403. That means: Forbidden. You do not have permission to access this resource, or are over your rate limit.');
break;
case 400:
// You may want to fall through here and read the specific XML error
die('Your call to Yahoo Web Services failed and returned an HTTP status of 400. That means: Bad request. The parameters passed to the service did not match as expected. The exact error is returned in the XML response.');
break;
default:
die('Your call to Yahoo Web Services returned an unexpected HTTP status of:' . $status_code[0]);
}
// Get the XML from the response, bypassing the header
if (!($xml = strstr($response, '<?xml'))) {
$xml = null;
}
//here is some regex to parse the returned xml for general data
preg_match_all('@talResultsAvailable="([0-9]+)" totalResultsReturned="([0-9]+)" firstResultPosition="([0-9]+)"@', $xml, $matches);
$total = $matches[1];
$returned = $matches[2];
$position = $matches[3];
//here is some regex to parse the returned xml for specific data
preg_match_all('@<Title>(.+?)</Title>@i', $xml, $title);
preg_match_all('@<DisplayUrl>(.+?)</DisplayUrl>@i', $xml, $url);
preg_match_all('@<Summary>(.+?)</Summary>@i', $xml, $summary);
//now let's process the raw arrays that the preg suite returns with the match function
$title = $title[1];
$url = $url[1];
$summary = $summary[1];
//print out some general info on this query
if (!($j > 0)){
echo "<h3>We are scraping Yahoo! SERPs for the term [$query]</h3>";
echo "<h2>There were $total[0] results. Returning $num results.</h2>";
}
//now let's loop through each result and display the title with the correlating URL
for ($k = 0; $k < $results; $k++){
$count = $j * 100 + $k + 1;
echo "$count<br />";
if (isset($title[$k])){
echo "Title: $title[$k]<br />";
}else{
echo "<strong>ERROR</strong> - Could not scrape title for this result<br />";
}
if (isset($summary[$k])){
echo "Summary: ".$summary[$k]."<br />";
}else{
echo "<strong>ERROR</strong> - Could not scrape summary for this result<br />";
}
if (isset($url[$k])){
echo "$url[$k]<br /><br />";
}else{
echo "<strong>ERROR</strong> - Could not scrape URL for this result<br /><br />";
}
}
}
}
?>