What do you need for a scrape/datamine/export tool

computergames

I code stuff :)
Oct 11, 2010
223
2
0
I already have my crawler engine (can handle forms, most javascript, lots of filtering options bla bla bla) and user interface, so I have begun wondering...

Is there, in general, any interest in a product that e.g. can scrape off e.g. product catalogs? (or whatever really) into CSV? SQL?

The crawler would simply go through entire website (possibly using the user defined filters to only crawl the wanted pages)... This would work well...

However, the user would still need to define what exactly he/she wanted to scrape off those pages... And I am wondering if regular expressions are too complex? Then again, many people interested in such software may already know regular expressions....
 


Would support fairly rich command line. (Already have that for my other programs) Exporting to CSV etc. would also be possible through command line.

I am primarily worried if regular expressions for matching what to extract is too high level requirement? Has anyone here seen a different system used anywhere? (I am contemplating if it could be done visually, but I don't think it will be easy to get working properly.)

The only thing I really need to "nail" is the user interface part where users can point different regex to different columns. Then later export to CSV file or similar.
 
whats the crawler engine coded in? Is it some browser based engine or you implemented it from scratch?
 
Delphi. Not browser based. (For core HTTP communication it can either use Windows API WinInet or Indy HTTP library.) Scratch.
 
To be honest, people would rather pay a coder $100 to scrape some target than learn a complicated tool, even something like regular expressions.
 
I appreciate honest comments. It's also why I am unsure if regular expressions are suitable for most users, and if anyone have seen an alternative method for defining the content to scrape? But I will try investigate the market and competing solutions. I am probably going through with it under all circumstances since it's so closely related to my existing products
 
If you have a place on the CLI to use the 'real' regex, that is cool because a lot of people may want to use the same expressions they used for other tools. For those less familiar, you could add a layer of abstraction to simplify some of the more commonly used search patterns.

Take a look at how wireshark does it. They essentially break it down to 'primitive' statements:
CaptureFilters - The Wireshark Wiki
 
Thanks everyone for your comments :)

I will probably start out by supporting regex, then possibly add on a visual abstraction of it somehow. Can see some products feature this (not sure how well it works)
 
How are you going to scrape specific information and save it into different named fields? I'd just straight up have it execute code that navigates the DOM. I mean honestly if you know PHP it's pretty trivial to write a pretty sophistocated in half a page or so with a DOM library. With html-simple-dom (I think that's the name) it's almost as easy as writing jquery code.

If you're using an embedded browser, there may actually be a way to load jquery and have the language be really simple javascript.
 
I use my own engine. I use a HTTP engine (optionally WinInet Windows API), rest I do myself. The crawler engine is 5 years old now (continually developed as needed) and used in variety of programs.

I will agree that DOM navigation would be an alternative way to regex matching. Still, if I really need to, I will generate my own tree from ground up.
 
The regex will be powerful enough to extract data (I've used my own scraping with regex + some coding for tons of different purposes through time), but it may be too complicated for many to use :(

It's an issue I will have to address later. In version one I will simply try assist the use in writing the regex and matching to data columns through various methods. But if you don't know regular expressions at all, the tool will not be useful in version 1
 
Regarding the COM question. If you are asking because you want to control the program outside, all I offer is a pretty comprehensive command line support where you can initiate jobs/projects. But IMO that will also suffice for far most uses. If you have multiple projects you want to have automated, just launch more instances :)

I have pondered about making the whole crawler engine exposed as e.g. COM, but it would be a major undertaking, and I rather spend that time on prepearing for e.g. native MAC ports