Craigslist Email Scraper with Python

kblessinggr · Apr 24, 2009

Code:

import re
import sys
import urllib2
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup

def searchUrl(cururl, sfile, category, begin = False):
    try:
        response = urllib2.urlopen(cururl)
        html = response.read()
        soup = BeautifulSoup(html)

	emails = soup.findAll('a', href=re.compile('^mailto:'))

	if len(emails) > 0:
	    for email in emails:
		email = email['href'][email['href'].index(':')+1:email['href'].index('?')]

	    sfile.write(email+'\n')
	    count = 0
	else:
	    links = soup.findAll('a', href=re.compile('^(/(.+/)?%s/.+)' % (category,)))
	    count = len(links)
	    if "index" in cururl:
		print "Parsing %s (%d postings)" % (cururl, count)
            
	    for link in links:
                searchUrl(urljoin(cururl, link['href']), sfile, category)
    
        response.close()
	return count	
    except ValueError:
        print "Could not parse %s, skipping" % (cururl,)

#Initialization
parameterflag = False
arguments = len(sys.argv)
page = 0
begin = True
    
if arguments > 1:
    starturl = sys.argv[1]
    folder = starturl.split('/')
    rcategory = folder[3]
    if len(rcategory) == 0:
	parameterflag = True
    else:
	if arguments > 2:
	    sfile = sys.argv[2]
	else:
	    sfile = "emails"
else:
    parameterflag = True
    
print "ScrapCL 0.1.2 by Karl Blessing (www.karlblessing.com)\n--------------------------------------------------\n"
if parameterflag:
    print "Usage: scrapcl start_url [save_file_name]\n"
    print "start_url must be a craigslist url with a preceeding category,\nresults are restricted to this category\n"
    print "\thttp://city.craigslist.org/category/\n"
    print "save_file_name is an optional parameter, if no name is provided\n emails will be defaulted.\n"
    print "\tscrapcl http://city.craigslist.org/sys/ syse\n\twill save a file called syse.txt with the emails (one per line)"

else:
    print "Starting from: %s\nSaving emails to %s.txt (this may take several minutes)" % (starturl, sfile)
    outputfile = open(sfile+".txt", "a")

    while page < 50000:
	if page > 0:
	    url = "%sindex%d.html" % (starturl, page)
	else:
	    url = "%sindex.html" % (starturl,)
	    begin = False
	    
	count = searchUrl(url, outputfile, rcategory, begin)

	if count < 100:
	    break;
	else:
	    page += 100

    print "Finished Collecting emails into %s.txt" % (sfile,)
    outputfile.close()

Requires Python 2.4 or above, and BeautifulSoup 3.0.* (do not try to use BeautifulSoup 3.1 with Python 2.6 or lower.)

Usage

python scrapcl.py http://city.craigslist.org/category/ emailfile

emailfile is optional, if not provided it'll default to emails.txt, if you type sys instead it'll save sys.txt. The write mode is set for append, so you could run the command again on another city, and it'll append those emails onto the list if the file already exists.

bobsoap · Apr 24, 2009

Ooh, Python... I'm more into PHP and thought Python was dying. I never looked into it.

This is kinda off-topic, but would you say it is worth it to put time into learning it? Just curious about your experience with it.

And thanks for the script, looks good even though I wouldn't know what to do with it right now

kblessinggr · Apr 24, 2009

bobsoap said:
Ooh, Python... I'm more into PHP and thought Python was dying. I never looked into it.

This is kinda off-topic, but would you say it is worth it to put time into learning it? Just curious about your experience with it.

And thanks for the script, looks good even though I wouldn't know what to do with it right now

Not sure, I'm still learning it, I just started learning it yesterday. Course to say Python is dying, is like saying C/C++ is dead.

bobsoap · Apr 24, 2009

kblessinggr said:
Not sure, I'm still learning it, I just started learning it yesterday. Course to say Python is dying, is like saying C/C++ is dead.

Lol that might be. Like I said, never looked into it.

You started learning it yesterday and have a script ready today? Respect.

kblessinggr · Apr 24, 2009

bobsoap said:
Lol that might be. Like I said, never looked into it.

You started learning it yesterday and have a script ready today? Respect.

Once you know a few programming languages its not extremely difficult to pick up the basics of another. Prior to PHP I knew C++, VB, ASP, and Pascal.

Also some of the examples for BeautifulSoup (handy html parser) and such came in useful. Was mainly a matter of figuring out the syntax and 'behavior' of Python. Python by comparison to PHP is an Object Oriented language where EVERYTHING I shit you not is considered an object.

But as far as using Python for the web. Not quite used to that yet. Take for example a simple print out functionality for the web.

PHP (via index.php?name=Karl)

Code:

<?="Hello ".$_GET['name']?>

Python (via mod_wsgi to a url like /?name=Karl)

Code:

from cgi import parse_qs, escape

def application(environ, start_response):
    start_response("200 OK", [])
    parameters = parse_qs(environ.get('QUERY_STRING', ''))
    s = "Hello %s" % parameters['name'][0]
    return [s]

(python is white space sensitive, its the tabbing that sets most of that under application())

DavidR · Apr 25, 2009

Python is dying? I knew WF's "programmers" were shitty but that takes the cake.

kblessinggr · Apr 25, 2009

DavidR said:
Python is dying? I knew WF's "programmers" were shitty but that takes the cake.

eh?

(the comment, or my 1st-day attempt?)

LazyD · Apr 25, 2009

DavidR said:
Python is dying? I knew WF's "programmers" were shitty but that takes the cake.

I tend to agree - more then anything, short-sighted

Python isnt a dying language - its still widely used and Django is a very popular framework for Python. There are a number of ports of it including one called Jython which uses Java byte code and allows it to be used on Java Virtual Machines. Microsoft is also in the process of creating a framework/system that allows you to make AJAX requests without javascript, its being writtein in a Rails version and a Python version . Also, this is taken from a Wikipedia article...

Among the users of Python are YouTube [22] and the original BitTorrent client.[23] Large organizations that make use of Python include Google,[24] Yahoo!,[25] CERN,[26], NASA,[27]and ITA.[28] Most of the Sugar software for the One Laptop Per Child XO, now developed at Sugar Labs, is written in Python.

If you notice, all of Google's Knowledge Base articles are writtein in Python. And also, like many other frameworks in other languages like PHP, ASP, etc - just because the extension doesnt in .py doesnt mean it isnt Python

They also mention its inclusion desktop programs like Blender, Maya, Gimp & Paint Shop Pro.

bobsoap · Apr 25, 2009

Points taken, my bad. Thanks for clarifying.

@DavidR, still no need to get all stuck-up

LogicFlux · Apr 25, 2009

I think google relies heavily on python. So much so that they hired the creator of the language, Guido van Rossum.

turbolapp · Apr 25, 2009

lol@ at the geek snobs.

DavidR · Apr 25, 2009

kblessinggr: my comment wasn't about your code. I made that comment because 99% of the so-called programmers here are stuck in their PHP world. That's fine and all, but unless someone steps out of their little bubble and actually produces work in other languages, they probably shouldn't be telling people "X is dead".

It was more of a one-line rant than anything else

fcu_1 · May 17, 2009

bobsoap said:
Ooh, Python... I'm more into PHP and thought Python was dying. I never looked into it.

Look into it.

Search

Search

Craigslist Email Scraper with Python

kblessinggr

PedoBeard

bobsoap

Together we can do anyone

kblessinggr

PedoBeard

bobsoap

Together we can do anyone

kblessinggr

PedoBeard

DavidR

New member

kblessinggr

PedoBeard

LazyD

$monies = false;

bobsoap

Together we can do anyone

LogicFlux

Donkey Fucker

turbolapp

New member

DavidR

New member

fcu_1

New member