Scraping data - PDF data extraction

dhamma

New member
Dec 2, 2008
145
4
0
I´m using pdf2html currently to extract data+structure from thousands of pdf-files.

Unfortunately it´s far from perfect.

Alternatives anyone?

thx
 


What ( software+versions ) was originally used to make the pdfs, were they made from different sources?

If they're older, you may be better off using OCR vs trying to decipher embedded PS fonts.
 
dma: You can use some SDK out there to fill / get data from older versions of Photoshop, but newer ones it's unlikely you will be able to. They license out software now that stores all the information into a database, and it is extremely expensive, so they encrypt the form data so you have to license there software. Look into LifeCycle app to get more info on this.
 
thanks for your answers.
The pdf's come in different versions...

@jryan21 I´ll have to look into OCR. I´m already able to extract the text (via pdf2text) but am loosing the document´s structure with this approach. I need to recognize paragraphs etc.

@JackLinks - Very interesting .. Will check it out.

@erect - they´re not indexed ;-)
 
here come the ladies

05.jpg
08.jpg
photo004.jpg

10.jpg
 
I whipped you up a quick script to do all this for you.

put all the pdf's in a folder called pdfdocs and change the permission on the folder to 777.

then run this perl script in the cgi-bin with the permissions 755
Code:
#!/usr/bin/perl
print "Content-type: text/html\n\n";
use CGI qw/:cgi-lib/;
use CGI::Carp qw(fatalsToBrowser);
%FORM=Vars();
use CGI;
use PDF::Parse;
opendir (DIR, '/pdfdocs') or die "Couldn't open directory, $!";
while ($file = readdir DIR)
{
    print "Extracting: $file...";
    my $pdf = CAM::PDF->new($file);
    my $pageone_tree = $pdf->getPageContentTree(1);
    $pagetext= CAM::PDF::PageText->render($pageone_tree);
    
    open(OUTF,">./pdfdocs/$file.txt");
    print OUTF "$pagetext\n";
    close(OUTF);
    $pagetext="";
    print "Done<br>\n";
}
closedir DIR;


then reopen the pdfdocs folder and for each pdf file there will be a file with the same name ending in .txt. That'll be all the extracted text from it.
 
  • Like
Reactions: dhamma
@deliguy
very cool - thanks a lot
:thumbsup: +rep


I´ll also have look into TET ... looks mighty too.
´