oh wait,
Code:domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
would make more sense for medway.
with open(platform_write, "r+") as f:
unique = set(f.read().split("\n"))
f.seek(0)
f.write("".join([line + "\n" for line in unique]))
f.truncate()
platform_write = "my path to file"
platform_out_file = open(platform_write, 'a')
platform_out_file.write(platform_clean)
platform_out_file.close()
Code:import urlparse urls = set(open('links.txt', 'r').read().replace('\r\n').split('\n')) seen_domains = [] output = [] for url in urls: domain = urlparse.urlparse(url).netloc.lower() if domain not in seen_domains: seen_domains.append(domain) output.append(url) print "Found %s unique domains" % len(output) f = open('output.txt', 'w') for i in output: f.write(i+"\n")
Will keep one url from each domain. (untested, but should work)
replace('\r\n', '')
import urlparse
urls = set(open('c:\\dropbox\\links.txt', 'r').read().replace('\r\n', '').split('\n'))
seen_domains = []
output = []
domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
if domains not in seen_domains:
seen_domains.append(domains)
output.append(url)
print output
domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls}
open('output.txt', 'w').write('\r\n'.join(domains.values())
'\r\n' is because you're using windows apps to generate your url lists, so in your case, yes.
'if domains not in seen_domains' = you are asking it if a dict is in a list, which doesn't make sense for this.
just do
Code:domains = {urlparse.urlparse(url).netloc.lower(): url for url in urls} open('output.txt', 'w').write('\r\n'.join(domains.values())
domains.values() = your urls.
you don't need seen_domains or output then.
ping me on skype if you need any more help medway.
Ok yea got the '\r\n' bit but when I ran as is I got an error stating .replace needs two arguments. Originally it was just ('\r\n') but looks like those were supposed to be erased so I changed it to ('\r\n', '') to pass the second argument as a null space.
Ah ok got it working using domain.values as output now now. Had just tried outputing domains before so wasn't deduped, thanks!
What you want is .replace('\r\n', '\n').split('\n')