提高读取大型CSV文件的效率

时间:2015-03-09 20:30:09

标签: python

我正在使用rake(Rapid automatics关键字提取算法)来生成关键字。我有大约5300万条记录,大约4.6gb。我想知道最好的方法来做到这一点。

我把rake很好地包裹在课堂上。我有一个4.5GB的文件,其中包含5300万条记录。以下是一些方法。

方法#1:

with open("~inputfile.csv") as fd:
   for line in fd:
      keywords = rake.run(line)
      write(keywords)

这是一种基本的蛮力方式。假设写入文件需要花费时间,调用它5300万次将是昂贵的。我使用了以下方法,一次写入100K行文件。

方法#2

with open("~inputfile.csv") as fd:
temp_string = ''
counter = 0
   for line in fd:
      keywords = rake.run(line)
      string = string + keywords + '\n'
      counter += 1
      if counter == 100000:
           write(string)
           string = ''

令我惊讶的是,方法#2花费的时间多于方法#1。我不明白!怎么可能?你们也可以建议一个更好的方法吗?

方法#3 (感谢cefstat)

with open("~inputfile.csv") as fd:
  strings = []
  counter = 0
  for line in fd:
    strings.append(rake.run(line))
    counter += 1
    if counter == 100000:
      write("\n".join(strings))
      write("\n")
      strings = []

运行速度快于方法#1& #2。

提前致谢!

2 个答案:

答案 0 :(得分:3)

正如评论中所提到的,Python已经缓冲了对文件的写入,因此在Python中实现自己的(与C相反,就像它已经存在一样)会使它变得更慢。您可以使用调用open的参数来调整缓冲区大小。

另一种方法是以块的形式读取文件。基本算法是这样的:

  1. 使用file.seek(x)对文件进行迭代,其中x =当前位置+所需的块大小
  2. 迭代时,记录每个块的起始端字节位置
  3. 在工作程序处理中(使用multiprocessing.Pool()),使用开始和结束字节位置读取块
  4. 每个进程都会写入自己的关键字文件

  5. 协调单独的文件。您可以选择以下几种方法:

    • 将关键字文件重新读回内存到单个列表中
    • 如果在* nix上,请使用" cat"组合关键字文件。命令。
    • 如果您使用的是Windows,则可以保留关键字文件列表(而不是一个文件路径)并根据需要迭代这些文件
  6. 有许多博客和食谱可以并行阅读大型文件:

    https://stackoverflow.com/a/8717312/2615940
    http://aamirhussain.com/2013/10/02/parsing-large-csv-files-in-python/
    http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/
    http://effbot.org/zone/wide-finder.htm

    旁注:我曾尝试做同样的事情并得到相同的结果。将文件写入外包给另一个线程也没有帮助(至少在我尝试的时候没有这样做。)

    这是一个演示算法的代码段:

    import functools
    import multiprocessing
    
    BYTES_PER_MB = 1048576
    
    # stand-in for whatever processing you need to do on each line
    # for demonstration, we'll just grab the first character of every non-empty line
    def line_processor(line):
        try:
            return line[0]
        except IndexError:
            return None
    
    # here's your worker function that executes in a worker process
    def parser(file_name, start, end):
    
        with open(file_name) as infile:
    
            # get to proper starting position
            infile.seek(start)
    
            # use read() to force exactly the number of bytes we want
            lines = infile.read(end - start).split("\n")
    
        return [line_processor(line) for line in lines]
    
    # this function splits the file into chunks and returns the start and end byte
    # positions of each chunk
    def chunk_file(file_name):
    
        chunk_start = 0
        chunk_size = 512 * BYTES_PER_MB # 512 MB chunk size
    
        with open(file_name) as infile:
    
            # we can't use the 'for line in infile' construct because fi.tell()
            # is not accurate during that kind of iteration
    
            while True:
                # move chunk end to the end of this chunk
                chunk_end = chunk_start + chunk_size
                infile.seek(chunk_end)
    
                # reading a line will advance the FP to the end of the line so that
                # chunks don't break lines
                line = infile.readline()
    
                # check to see if we've read past the end of the file
                if line == '':
                    yield (chunk_start, chunk_end)
                    break
    
                # adjust chunk end to ensure it didn't break a line
                chunk_end = infile.tell()
    
                yield (chunk_start, chunk_end)
    
                # move starting point to the beginning of the new chunk
                chunk_start = chunk_end
    
        return
    
    if __name__ == "__main__":
    
        pool = multiprocessing.Pool()
    
        keywords = []
    
        file_name = # enter your file name here
    
        # bind the file name argument to the parsing function so we dont' have to
        # explicitly pass it every time
        new_parser = functools.partial(parser, file_name)
    
        # chunk out the file and launch the subprocesses in one step
        for keyword_list in pool.starmap(new_parser, chunk_file(file_name)):
    
            # as each list is available, extend the keyword list with the new one
            # there are definitely faster ways to do this - have a look at 
            # itertools.chain() for other ways to iterate over or combine your
            # keyword lists
            keywords.extend(keyword_list) 
    
        # now do whatever you need to do with your list of keywords
    

答案 1 :(得分:3)

在Python中反复添加字符串非常慢(如jedwards所述)。您可以尝试以下标准替代方案,它几乎肯定会比#2更快,而且在我的有限测试中看起来比方法#1快30%(尽管可能不够快,无法满足您的需求):

with open("~inputfile.csv") as fd:
  strings = []
  counter = 0
  for line in fd:
    strings.append(rake.run(line))
    counter += 1
    if counter == 100000:
      write("\n".join(strings))
      write("\n")
      strings = []