Craigslist Email Scraper with Python

Status
Not open for further replies.

kblessinggr

PedoBeard
Sep 15, 2008
5,723
80
0
G.R., Michigan
www.kbeezie.com
Code:
import re
import sys
import urllib2
from urlparse import urljoin
from BeautifulSoup import BeautifulSoup

def searchUrl(cururl, sfile, category, begin = False):
    try:
        response = urllib2.urlopen(cururl)
        html = response.read()
        soup = BeautifulSoup(html)

	emails = soup.findAll('a', href=re.compile('^mailto:'))

	if len(emails) > 0:
	    for email in emails:
		email = email['href'][email['href'].index(':')+1:email['href'].index('?')]

	    sfile.write(email+'\n')
	    count = 0
	else:
	    links = soup.findAll('a', href=re.compile('^(/(.+/)?%s/.+)' % (category,)))
	    count = len(links)
	    if "index" in cururl:
		print "Parsing %s (%d postings)" % (cururl, count)
            
	    for link in links:
                searchUrl(urljoin(cururl, link['href']), sfile, category)
    
        response.close()
	return count	
    except ValueError:
        print "Could not parse %s, skipping" % (cururl,)

#Initialization
parameterflag = False
arguments = len(sys.argv)
page = 0
begin = True
    
if arguments > 1:
    starturl = sys.argv[1]
    folder = starturl.split('/')
    rcategory = folder[3]
    if len(rcategory) == 0:
	parameterflag = True
    else:
	if arguments > 2:
	    sfile = sys.argv[2]
	else:
	    sfile = "emails"
else:
    parameterflag = True
    
print "ScrapCL 0.1.2 by Karl Blessing (www.karlblessing.com)\n--------------------------------------------------\n"
if parameterflag:
    print "Usage: scrapcl start_url [save_file_name]\n"
    print "start_url must be a craigslist url with a preceeding category,\nresults are restricted to this category\n"
    print "\thttp://city.craigslist.org/category/\n"
    print "save_file_name is an optional parameter, if no name is provided\n emails will be defaulted.\n"
    print "\tscrapcl http://city.craigslist.org/sys/ syse\n\twill save a file called syse.txt with the emails (one per line)"

else:
    print "Starting from: %s\nSaving emails to %s.txt (this may take several minutes)" % (starturl, sfile)
    outputfile = open(sfile+".txt", "a")

    while page < 50000:
	if page > 0:
	    url = "%sindex%d.html" % (starturl, page)
	else:
	    url = "%sindex.html" % (starturl,)
	    begin = False
	    
	count = searchUrl(url, outputfile, rcategory, begin)

	if count < 100:
	    break;
	else:
	    page += 100

    print "Finished Collecting emails into %s.txt" % (sfile,)
    outputfile.close()

Requires Python 2.4 or above, and BeautifulSoup 3.0.* (do not try to use BeautifulSoup 3.1 with Python 2.6 or lower.)

Usage

emailfile is optional, if not provided it'll default to emails.txt, if you type sys instead it'll save sys.txt. The write mode is set for append, so you could run the command again on another city, and it'll append those emails onto the list if the file already exists.
 
  • Like
Reactions: brentdev


Ooh, Python... I'm more into PHP and thought Python was dying. I never looked into it.

This is kinda off-topic, but would you say it is worth it to put time into learning it? Just curious about your experience with it.

And thanks for the script, looks good even though I wouldn't know what to do with it right now :)
 
Ooh, Python... I'm more into PHP and thought Python was dying. I never looked into it.

This is kinda off-topic, but would you say it is worth it to put time into learning it? Just curious about your experience with it.

And thanks for the script, looks good even though I wouldn't know what to do with it right now :)

Not sure, I'm still learning it, I just started learning it yesterday. Course to say Python is dying, is like saying C/C++ is dead.
 
Lol that might be. Like I said, never looked into it.

You started learning it yesterday and have a script ready today? Respect.

Once you know a few programming languages its not extremely difficult to pick up the basics of another. Prior to PHP I knew C++, VB, ASP, and Pascal.

Also some of the examples for BeautifulSoup (handy html parser) and such came in useful. Was mainly a matter of figuring out the syntax and 'behavior' of Python. Python by comparison to PHP is an Object Oriented language where EVERYTHING I shit you not is considered an object.

But as far as using Python for the web. Not quite used to that yet. Take for example a simple print out functionality for the web.

PHP (via index.php?name=Karl)
Code:
<?="Hello ".$_GET['name']?>

Python (via mod_wsgi to a url like /?name=Karl)
Code:
from cgi import parse_qs, escape

def application(environ, start_response):
    start_response("200 OK", [])
    parameters = parse_qs(environ.get('QUERY_STRING', ''))
    s = "Hello %s" % parameters['name'][0]
    return [s]

(python is white space sensitive, its the tabbing that sets most of that under application())
 
Python is dying? I knew WF's "programmers" were shitty but that takes the cake.

I tend to agree - more then anything, short-sighted

Python isnt a dying language - its still widely used and Django is a very popular framework for Python. There are a number of ports of it including one called Jython which uses Java byte code and allows it to be used on Java Virtual Machines. Microsoft is also in the process of creating a framework/system that allows you to make AJAX requests without javascript, its being writtein in a Rails version and a Python version . Also, this is taken from a Wikipedia article...

Among the users of Python are YouTube[22] and the original BitTorrent client.[23] Large organizations that make use of Python include Google,[24] Yahoo!,[25] CERN,[26], NASA,[27]and ITA.[28] Most of the Sugar software for the One Laptop Per Child XO, now developed at Sugar Labs, is written in Python.

If you notice, all of Google's Knowledge Base articles are writtein in Python. And also, like many other frameworks in other languages like PHP, ASP, etc - just because the extension doesnt in .py doesnt mean it isnt Python

They also mention its inclusion desktop programs like Blender, Maya, Gimp & Paint Shop Pro.
 
kblessinggr: my comment wasn't about your code. I made that comment because 99% of the so-called programmers here are stuck in their PHP world. That's fine and all, but unless someone steps out of their little bubble and actually produces work in other languages, they probably shouldn't be telling people "X is dead".

It was more of a one-line rant than anything else :)
 
Status
Not open for further replies.