SCIPT/TOOL - tag automatically huge quantity of content

mcuk2000 · Nov 28, 2008

Let's say I have a huge DB of content and I need to write a script that would extract the keywords, or the tags, in it.
I might have access to a DB with the keywords to campare if needed by the script.

Anyone knows of a tool, a script, or might send me in the right direction?

ozonew4m · Nov 28, 2008

You could parse the database content with a keyword density script and then use most common words as your tags.. you would need a huge list of stop words though i think..
you could probably modify this:
Class: Magic HTML Parser (html parser, html parse, parse html) - PHP Classes
but you could probably find a better one by searching google for a keyword density script..

If you have a list of keywords you could use mysqls full text search to match articles to your keywords by relevance
MySQL :: MySQL 5.0 Reference Manual :: 11.8 Full-Text Search Functions

hope that helps

Enigmabomb · Nov 28, 2008

another vote for mysql.

Enigmabomb · Nov 28, 2008

or just parse word by word php using regex. Put every word into an array, unique it, then count the occurrences of each word, take top x percent of the array.

dirty. but easy.

Like my ex.

ozonew4m · Nov 28, 2008

Enigmabomb said:
Put every word into an array, unique it, then count the occurrences of each word

if you unique the array wouldnt you just end up with a count of one for each word?

Houdas · Nov 28, 2008

Enigmabomb said:
dirty. but easy.

Like my ex.

LOL!
But, he would need a really huge list of stop words, as mentioned above already. I tried this approach several times, but it's really not quite usable - you end up with a lot of stupid tags like "him", "was", "earlier", "after" etc...

audax · Nov 28, 2008

Combine the word counts with something like the WordNet db to filter out type of words you don't want like adverbs, conjunctions, etc.

Dreamer · Nov 29, 2008

I've used Yahoo!'s Term Extraction Web Service to tag an entire forum. Worked really well for English content.

Notice there's a 5,000 queries per IP address limit you might have to work around.

mcuk2000 · Nov 30, 2008

thanks everybody. trying different approaches now, including thos you suggested, will post feedback soon.

Search

Search

SCIPT/TOOL - tag automatically huge quantity of content

mcuk2000

New member

ozonew4m

ThisTimeNextYearRodders

Enigmabomb

New member

Enigmabomb

New member

ozonew4m

ThisTimeNextYearRodders

Houdas

Member

audax

New member

Dreamer

New member

mcuk2000

New member