Huge data set, better option than a database?

danny · Mar 6, 2011

So let's say I have one table that has 50-100 million entries. Just a text value and a numerical value, no primary key or index.

I'm inserting the results of a query into a new table, a GROUP by thing with a COUNT to aggregate and count up all the rows that have the same text value.

Is there some less resource heavy way to do this or should I just let it chug away? I'll probably never have to run this again unless I decide to use more data. Been a half hour so far, no idea how long this will be going for.

lschmidt · Mar 6, 2011

danny said:
So let's say I have one table that has 50-100 million entries. Just a text value and a numerical value, no primary key or index.

I'm inserting the results of a query into a new table, a GROUP by thing with a COUNT to aggregate and count up all the rows that have the same text value.

Is there some less resource heavy way to do this or should I just let it chug away? I'll probably never have to run this again unless I decide to use more data. Been a half hour so far, no idea how long this will be going for.

Post that here

mattseh · Mar 7, 2011

assuming you dumped it to a text file (or read it in with a script):

Code:

counts = {}

for line in open('file.txt'):
    value = line.strip().split(',')[0]
    if value not in counts:
        counts[value] = 0
    counts[value] += 1

gives you a hashmap / dictionary (this is in python).

easy to then dump it back into the database.

mysql will be doing it all at once, which will hurt.

uplinked · Mar 7, 2011

Mattseh's spot on. Fastest way to get 100m rows out of the database at once is mysqldump

Code:

mysqldump -uuser -ppassword -T/output_path/ --fields-terminated-by=, db_name table_name

Then you read the whole file in and sum up the uniques. His python script is a good way to do this, but quicker and dirtier would be:

Code:

 cat /output_path/table_name.txt | cut -d, -f1 | uniq -c

Fortuna · Mar 7, 2011

MongoDB makes operations like this easy as well and is simple to shard if you need to scale.

danny · Mar 8, 2011

mattseh said:
assuming you dumped it to a text file (or read it in with a script):

Code:

counts = {} for line in open('file.txt'): value = line.strip().split(',')[0] if value not in counts: counts[value] = 0 counts[value] += 1

gives you a hashmap / dictionary (this is in python).

easy to then dump it back into the database.

mysql will be doing it all at once, which will hurt.

This is basically what I ended up doing, but it takes a shitload of memory and a long time. Still, way way better than any DBMS.

Search

Search

Huge data set, better option than a database?

danny

New member

lschmidt

New member

mattseh

import this

uplinked

Code-whisperer

Fortuna

New member

danny

New member