Any ideas for optimization?

mattseh · Jan 10, 2012

Code:

class RandomLines(object):
    def __init__(self, input_file, cache_index=True):
        if isinstance(input_file, basestring):
            self.source_file = open(input_file,'rb')
            filename = input_file
        else:
            self.source_file = input_file
            filename = input_file.name
        self.index = []

        if not os.path.isfile(filename+'.lineindex'):
            bytes_counter = 0
            for line in self.source_file:
                bytes_counter += len(line)
                if len(line.strip()):
                    self.index.append(bytes_counter-len(line))
            if cache_index:
                open(filename+'.lineindex','w').write('\n'.join(str(i) for i in self.index))
        else:
            self.index = [int(line.strip()) for in line in open(filename+'.lineindex')]

    def __iter__(self):
        return self

    def next(self):
        while len(self.index):
            offset = self.index.pop(random.randrange(0, len(self.index)))
            self.source_file.seek(offset, 0)
            return self.source_file.readline().strip()
        raise StopIteration

Performs OK, but I have a suspicion it could be a lot more efficient. Disk seeking for every line seems like a bad idea, caching perhaps?

inb4 stupid language war comments.

mr_neuro · Jan 10, 2012

Caching would be a good idea although the most obvious idea that pops to my head is that you don't seem to have any source of threading here so no amount of caching in the world is going to do you much good if you are only using one thread at a time.

mattseh · Jan 10, 2012

mr_neuro said:
Caching would be a good idea although the most obvious idea that pops to my head is that you don't seem to have any source of threading here so no amount of caching in the world is going to do you much good if you are only using one thread at a time.

This is being consumed by a multi-"thread" gevent application.

Search

Search

Any ideas for optimization?

mattseh

import this

mr_neuro

New member

mattseh

import this