Huge data set, better option than a database?

danny

New member
Mar 10, 2007
298
4
0
New Jersey
So let's say I have one table that has 50-100 million entries. Just a text value and a numerical value, no primary key or index.

I'm inserting the results of a query into a new table, a GROUP by thing with a COUNT to aggregate and count up all the rows that have the same text value.

Is there some less resource heavy way to do this or should I just let it chug away? I'll probably never have to run this again unless I decide to use more data. Been a half hour so far, no idea how long this will be going for.
 


So let's say I have one table that has 50-100 million entries. Just a text value and a numerical value, no primary key or index.

I'm inserting the results of a query into a new table, a GROUP by thing with a COUNT to aggregate and count up all the rows that have the same text value.

Is there some less resource heavy way to do this or should I just let it chug away? I'll probably never have to run this again unless I decide to use more data. Been a half hour so far, no idea how long this will be going for.

Post that here
 
assuming you dumped it to a text file (or read it in with a script):

Code:
counts = {}

for line in open('file.txt'):
    value = line.strip().split(',')[0]
    if value not in counts:
        counts[value] = 0
    counts[value] += 1

gives you a hashmap / dictionary (this is in python).

easy to then dump it back into the database.

mysql will be doing it all at once, which will hurt.
 
Mattseh's spot on. Fastest way to get 100m rows out of the database at once is mysqldump
Code:
mysqldump -uuser -ppassword -T/output_path/ --fields-terminated-by=, db_name table_name

Then you read the whole file in and sum up the uniques. His python script is a good way to do this, but quicker and dirtier would be:
Code:
 cat /output_path/table_name.txt | cut -d, -f1 | uniq -c
 
MongoDB makes operations like this easy as well and is simple to shard if you need to scale.
 
assuming you dumped it to a text file (or read it in with a script):

Code:
counts = {}

for line in open('file.txt'):
    value = line.strip().split(',')[0]
    if value not in counts:
        counts[value] = 0
    counts[value] += 1

gives you a hashmap / dictionary (this is in python).

easy to then dump it back into the database.

mysql will be doing it all at once, which will hurt.

This is basically what I ended up doing, but it takes a shitload of memory and a long time. Still, way way better than any DBMS.