extracting keywords from text

Icecube

Up 24h/day
Mar 14, 2007
1,169
9
0
Europe
I've searched google and didn't find a php script that can extract keywords from a text.

Before I get mad at writing a rudimentary one, do you happen to have/know some php script/class that does that?

It should not use external apis or anything like that as I'll need to run it against > 100.000 texts
 


Pretty funny there.

I want to know how many people in the world fart in the shower. But I have to find the answer in my bathroom without any connection/information from outside the bathroom.
 
Basically to elaborate the point, its not the words that determine if they are keywords. Its how often they're used together that determine "keywords". Its also the number of times the "keyword" appeared on the entire indexed internet divided by Total Number of Sites on the Internet that determines whether its "cat" "food" or "cat food".

Then relevancy comes into play, how often was "cat food" used with "pet supplies". Thus cat food comes up because in the entire internet 100,000 sites used "cat food" on sites about "pet supplies".

So you see how your question is fundamentally flawed?
 
You're right, but I am just looking for the words/multiple words that have more occurrences in the text I pass to such function, excluding short/common words ( for, the, as, in, and...), not considering the rest of world out of my input.

it seems I should set a max number of words to consider as a single keyword, then count the occurrences for each keyword length and pick, say, the top 4 for every length.

this is ballpark science, nothing to automate ppc with! sure it sounds easier to think than to code, but it's something I don't need immediately so I thought I'd ask if someone had something to play with.

Your question about farting in the shower is intriguing, it would be actually interesting to know the numbers.
 
Hmm
- strip the html
- strip your stop words (the, or, etc). also strip punctuation
- explode the text at the space character to form an array
- loop through each and incremement a count of each word (another associative array)
- repeat, exploding at every other space, then every third space, etc

not entirely sure but that's my rough guess if you have to do it yourself
 
not sure if you found my post to be of any use but i realized an error in it so i figured i should update. you wouldn't want to break on every space and then on every other space, etc. my thinking was that way you get single-word keywords and then two-word keywords, but it's done incorrectly.
assume text of "word word2 word3 word4", you would break it into "word word2" and "word3 word4" but never "word2 word3".
so you would actually want want to break at every space into an array, and then for two-word keyphrases you have two For loops through that array, one combining the word at the current index with the next word in the index, and one combining the current word with the word previous ($array[$i-1]).

just in case this helps ;)
 
this is entering the realm of NLP but if you want a fairly simple and effective approach...

establish max keyword length (i.e. num words).
normalize your text, i.e. convert all to lower case, stem the words (turn 'cats' into cat so they can compare correctly).
remove stopwords (google it to get a list of common ones).

create a separate array for 1, 2, 3 etc word combos, use the actual word combo as the array key and pass through the text creating a count for every combo so you get a frequency count. Should be something pretty easy like

Code:
$wordCombo = getWordCombo($wordPosition); //you'll have write that function yourself of course.
if ($combos[$wordCombo] == null)
    $combos[$wordCombo] = 1;
else
    $combos[$wordCombo] = $combos[$wordCombo] + 1;

average the frequency count for each length of word combo, then divide each individual combo's frequency by the average frequency to give you a relative frequency.

last step which is very useful for establishing relative importance of the combination Vs the constituent words... for each word in each word combo, divide the number of times it appears in that combo by the total number of times it appears as a single word. Do that for each word in the combo and multiply result by the relative frequency you calced earlier then multiply by the length of the word combo to cancel out the effect of multiple divisions (this last step should make your 3 word combo scores comparable to your 2 word and 4 words scores) with a slight weighting in favor of longer word combos...

sort by score...

this might be helpful in getting your head around the important concepts here:
LingPipe: Signicant Phrases Tutorial

It sounds a lot tricker than it is... just break up your code into logical chunks and build one a time making sure you've got the right output for each stage before you move onto the next...
 
I started building something like this a while back and never got around to finishing it.

Keyword Generator

It just looks for how often single words are used and not phrases. Might not be what you're looking for but look at it and if you want the source PM me and I'll send it to you.