Question

我正在尝试处理我的巨大CSV文件（超过20G），但是当将整个CSV文件读入内存时，该过程被终止。为了避免这个问题，我试图逐行读取第二列。

例如，第二列包含

之类的数据

xxx，电脑很好

xxx，构建算法

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for line in infile:
         wordcount.update(line.split())

我的代码适用于整个列，如何在不使用CSV阅读器的情况下只读取第二列？

Answer 1

据我所知，调用csv.reader(infile)会打开并读取整个文件...这就是问题所在。

您可以逐行阅读并手动解析：

X=[]

with open('desc.csv', 'r') as infile:    
   for line in infile:
      # Split on comma first
      cols = [x.strip() for x in line.split(',')]

      # Grab 2nd "column"
      col2 = cols[1]

      # Split on spaces
      words = [x.strip() for x in col2.split(' ')]
      for word in words:     
         if word not in X:
            X.append(word)

for w in X:
   print w

这将在给定时间（一行）将较小的文件块保留在内存中。但是，您可能仍然可能会遇到变量X增加到相当大的问题，这样程序将因内存限制而出错。取决于“词汇”列表中有多少独特单词

Answer 2

看起来您的问题中的代码正在读取20G文件并将每一行拆分为空格分隔的标记然后创建一个计数器，该计数器保留每个唯一标记的计数。我说这就是你记忆的发展方向。

从手册csv.reader是一个迭代器

一个读取器对象，它将迭代给定csvfile中的行。 csvfile可以是支持迭代器协议的任何对象每次调用next（）方法时返回一个字符串

所以可以使用csv.reader迭代一个巨大的文件。

import collections

wordcount = collections.Counter()

with open('desc.csv', 'rb') as infile:
    for row in csv.reader(infile):
        # count words in strings from second column
        wordcount.update(row[1].split())

有效地从巨大的CSV文件中读取数据

2 个答案: