SCIPT/TOOL - tag automatically huge quantity of content

Status
Not open for further replies.

mcuk2000

New member
May 26, 2008
201
1
0
London, UK - NYC, US
Let's say I have a huge DB of content and I need to write a script that would extract the keywords, or the tags, in it.
I might have access to a DB with the keywords to campare if needed by the script.

Anyone knows of a tool, a script, or might send me in the right direction?
 


You could parse the database content with a keyword density script and then use most common words as your tags.. you would need a huge list of stop words though i think..
you could probably modify this:
Class: Magic HTML Parser (html parser, html parse, parse html) - PHP Classes
but you could probably find a better one by searching google for a keyword density script..


If you have a list of keywords you could use mysqls full text search to match articles to your keywords by relevance
MySQL :: MySQL 5.0 Reference Manual :: 11.8 Full-Text Search Functions


hope that helps
 
or just parse word by word php using regex. Put every word into an array, unique it, then count the occurrences of each word, take top x percent of the array.

dirty. but easy.

Like my ex.
 
dirty. but easy.

Like my ex.

LOL!
But, he would need a really huge list of stop words, as mentioned above already. I tried this approach several times, but it's really not quite usable - you end up with a lot of stupid tags like "him", "was", "earlier", "after" etc...
 
Combine the word counts with something like the WordNet db to filter out type of words you don't want like adverbs, conjunctions, etc.
 
Status
Not open for further replies.