Scraping data - PDF data extraction

dhamma · Jan 19, 2010

I´m using pdf2html currently to extract data+structure from thousands of pdf-files.

Unfortunately it´s far from perfect.

Alternatives anyone?

thx

jryan21 · Jan 19, 2010

What ( software+versions ) was originally used to make the pdfs, were they made from different sources?

If they're older, you may be better off using OCR vs trying to decipher embedded PS fonts.

JackLinks · Jan 19, 2010

dma: You can use some SDK out there to fill / get data from older versions of Photoshop, but newer ones it's unlikely you will be able to. They license out software now that stores all the information into a database, and it is extremely expensive, so they encrypt the form data so you have to license there software. Look into LifeCycle app to get more info on this.

erect · Jan 19, 2010

Yahoo/google has a html cache if they're indexed.

guerilla · Jan 19, 2010

erect said:
Yahoo/google has a html cache if they're indexed.

You're awesome. That answer solves a problem I was having with ubot.

dhamma · Jan 20, 2010

thanks for your answers.
The pdf's come in different versions...

@jryan21 I´ll have to look into OCR. I´m already able to extract the text (via pdf2text) but am loosing the document´s structure with this approach. I need to recognize paragraphs etc.

@JackLinks - Very interesting .. Will check it out.

@erect - they´re not indexed ;-)

dhamma · Jan 20, 2010

here come the ladies

showkiller · Jan 20, 2010

google PDFlibTET

Deliguy · Jan 20, 2010

I whipped you up a quick script to do all this for you.

put all the pdf's in a folder called pdfdocs and change the permission on the folder to 777.

then run this perl script in the cgi-bin with the permissions 755

Code:

#!/usr/bin/perl
print "Content-type: text/html\n\n";
use CGI qw/:cgi-lib/;
use CGI::Carp qw(fatalsToBrowser);
%FORM=Vars();
use CGI;
use PDF::Parse;
opendir (DIR, '/pdfdocs') or die "Couldn't open directory, $!";
while ($file = readdir DIR)
{
    print "Extracting: $file...";
    my $pdf = CAM::PDF->new($file);
    my $pageone_tree = $pdf->getPageContentTree(1);
    $pagetext= CAM::PDF::PageText->render($pageone_tree);
    
    open(OUTF,">./pdfdocs/$file.txt");
    print OUTF "$pagetext\n";
    close(OUTF);
    $pagetext="";
    print "Done<br>\n";
}
closedir DIR;

then reopen the pdfdocs folder and for each pdf file there will be a file with the same name ending in .txt. That'll be all the extracted text from it.

dhamma · Jan 21, 2010

@deliguy
very cool - thanks a lot
:thumbsup: +rep

I´ll also have look into TET ... looks mighty too.
´

dhamma · Jan 21, 2010

damn the ladies left.. one more try

Search

Search

Scraping data - PDF data extraction

dhamma

New member

jryan21

Level 4 Grindstone

JackLinks

New member

erect

New member

guerilla

All we do is win

dhamma

New member

dhamma

New member

showkiller

New member

Deliguy

New member

dhamma

New member

dhamma

New member