Help removing keywords from large lists

Moon54

New member
Jul 17, 2006
182
2
0
hey all, I've got a large text list of keywords (100MB) that has keywords in broad, phrase and exact match. The list looks like this;

keyword 1
"keyword 1"
[keyword 1]
keyword 2
"keyword 2"
[keyword 2]
.......etc

Does anyone know of some way to remove the duplicate keywords? So that I would be left with just the broad-match ones. eg

keyword 1
keyword 2
.......

Because the file is a large textfile, I'm hoping to do it in Notepad++.
I'm guessing there is a clever regular expression that should be able to do this.

Anyone got any ideas on how to do this?
thx
 


ummmm not sure what you are doing...BUT... i don't think you need/want that many keywords in one group....sort them out and make your ads alittle more targeted....


But what do i know....export in excel...or Download adwords editor....both will make it alittle easier...
 
paste into excel or a text editor that can sort the list. then you can easily select all rows starting with " or [ and hit delete.
 
thx guys,
yeah, I thought about doing that but the keyword list does need to stay in order.

The logic of what I am trying to do would be something like "find a keyword with "" or [], remove whatever is between the "" or [], then remove the ""/[] and remove the blank line, then repeat"

Not to worry, this probably could be done with a php script too.
I'll figure it out, cheers.
 
thx guys,
yeah, I thought about doing that but the keyword list does need to stay in order.

Open with a spreadsheet, in the column next to the keywords add 1,2,3 etc (you can drag or double click to populate all rows). Sort the list (and the added column) alphabetically, remove lines starting with " or [ then sort by the numerical column and your list will be back in the correct order.

The only downside is that your list can only be approx 65k lines long.
 
  • Like
Reactions: Moon54
Ahh, nice thinking j0hn, that's a cool way to do it, too. :)

That'll work just fine, cause I can chop the big file into managable chunks and work from there. Nice one.
+rep 2 u
 
I'm guessing there is a clever regular expression that should be able to do this
yep, but "remove the whole line" isn't something that most gui text editors support well, even if they've got regex search-and-replace.

from a mac or linux command line -- (both should have egrep. if you're on windows, install cygwin, or mail me the list and i'll run this command for you, or upload it to a server you have SSH access to, or etc etc etc)
Code:
egrep -v '^("|\[)' file.txt > newfile.txt
Translation: Find all lines excluding (-v) those starting (^) with quotes or a bracket ( ("|\[) )

I fucking love using the right tool for a job.
 
  • Like
Reactions: Moon54
Thanks uplinked,

That code makes me want to dust off my Ubuntu Cds and install it again on my second PC!

That code is an elegant way to do it, I love it.
I just downloaded cygwin and i'm gonna try it. :)
 
On Windows:

If you have Perl installed and in your Path, you can open a command promt, cd to the directory where the file is located (or install "Open Command Window Here" from Microsoft PowerToys for Windows XP to just right-click the folder and open a command prompt that's already pointing to that directory).

Then run the following one liner.

Code:
perl -ni.bak -e "print if !/^(?:\[|\")/" file.txt

That will strip the exact and phrase match lines from the original file and create a backup of the original with a '.bak' extension (file.txt.bak) in the same directory

If you have PHP installed, and in your path, you can use the following.

Code:
php -r "$nf = fopen('newfile.txt','w'); $f = file('file.txt'); foreach($f as $l) {if (!preg_match('/^[\[\"]/',$l)) fwrite($nf,$l);};"

That will create a file called "newfile.txt" (or whatever you want to name it) and write the filtered output to it, leaving the original file intact.

Remember to change 'file.txt' in the above examples to your keyword file name.
 
lols

hit control+F, click on replace, put
[*] in the find and leave replace blank, click replace all

You may have to mess around to do the same with quotes, maybe /"*/" or \"*\" or "*"