What do you need for a scrape/datamine/export tool

computergames · Nov 25, 2010

I already have my crawler engine (can handle forms, most javascript, lots of filtering options bla bla bla) and user interface, so I have begun wondering...

Is there, in general, any interest in a product that e.g. can scrape off e.g. product catalogs? (or whatever really) into CSV? SQL?

The crawler would simply go through entire website (possibly using the user defined filters to only crawl the wanted pages)... This would work well...

However, the user would still need to define what exactly he/she wanted to scrape off those pages... And I am wondering if regular expressions are too complex? Then again, many people interested in such software may already know regular expressions....

Hammi · Nov 25, 2010

A way to control it from and export the data to other controller programs written in python/ruby or something.

richardcrish · Nov 25, 2010

scrape

I need different types scrapes, i follow this post.

computergames · Nov 26, 2010

Would support fairly rich command line. (Already have that for my other programs) Exporting to CSV etc. would also be possible through command line.

I am primarily worried if regular expressions for matching what to extract is too high level requirement? Has anyone here seen a different system used anywhere? (I am contemplating if it could be done visually, but I don't think it will be easy to get working properly.)

The only thing I really need to "nail" is the user interface part where users can point different regex to different columns. Then later export to CSV file or similar.

imcash · Nov 28, 2010

whats the crawler engine coded in? Is it some browser based engine or you implemented it from scratch?

computergames · Nov 28, 2010

Delphi. Not browser based. (For core HTTP communication it can either use Windows API WinInet or Indy HTTP library.) Scratch.

mattseh · Nov 28, 2010

To be honest, people would rather pay a coder $100 to scrape some target than learn a complicated tool, even something like regular expressions.

computergames · Nov 28, 2010

I appreciate honest comments. It's also why I am unsure if regular expressions are suitable for most users, and if anyone have seen an alternative method for defining the content to scrape? But I will try investigate the market and competing solutions. I am probably going through with it under all circumstances since it's so closely related to my existing products

jryan21 · Nov 28, 2010

If you have a place on the CLI to use the 'real' regex, that is cool because a lot of people may want to use the same expressions they used for other tools. For those less familiar, you could add a layer of abstraction to simplify some of the more commonly used search patterns.

Take a look at how wireshark does it. They essentially break it down to 'primitive' statements:
CaptureFilters - The Wireshark Wiki

computergames · Dec 1, 2010

Thanks everyone for your comments

I will probably start out by supporting regex, then possibly add on a visual abstraction of it somehow. Can see some products feature this (not sure how well it works)

danny · Dec 8, 2010

How are you going to scrape specific information and save it into different named fields? I'd just straight up have it execute code that navigates the DOM. I mean honestly if you know PHP it's pretty trivial to write a pretty sophistocated in half a page or so with a DOM library. With html-simple-dom (I think that's the name) it's almost as easy as writing jquery code.

If you're using an embedded browser, there may actually be a way to load jquery and have the language be really simple javascript.

Hammi · Dec 8, 2010

Is this windows+COM based or linux based?

computergames · Dec 17, 2010

I use my own engine. I use a HTTP engine (optionally WinInet Windows API), rest I do myself. The crawler engine is 5 years old now (continually developed as needed) and used in variety of programs.

I will agree that DOM navigation would be an alternative way to regex matching. Still, if I really need to, I will generate my own tree from ground up.

computergames · Dec 17, 2010

The regex will be powerful enough to extract data (I've used my own scraping with regex + some coding for tons of different purposes through time), but it may be too complicated for many to use

It's an issue I will have to address later. In version one I will simply try assist the use in writing the regex and matching to data columns through various methods. But if you don't know regular expressions at all, the tool will not be useful in version 1

computergames · Dec 17, 2010

Regarding the COM question. If you are asking because you want to control the program outside, all I offer is a pretty comprehensive command line support where you can initiate jobs/projects. But IMO that will also suffice for far most uses. If you have multiple projects you want to have automated, just launch more instances

I have pondered about making the whole crawler engine exposed as e.g. COM, but it would be a major undertaking, and I rather spend that time on prepearing for e.g. native MAC ports

Search

Search

What do you need for a scrape/datamine/export tool

computergames

I code stuff :)

Hammi

New member

richardcrish

Banned

computergames

I code stuff :)

imcash

New member

computergames

I code stuff :)

mattseh

import this

computergames

I code stuff :)

jryan21

Level 4 Grindstone

computergames

I code stuff :)

danny

New member

Hammi

New member

computergames

I code stuff :)

computergames

I code stuff :)

computergames

I code stuff :)