Question

我使用Python在大型csv文件（120万行，250MB）中找到一些模式，如果找到了这样的模式，则对每行进行一些修改。我的方法是这样的：

dfile=open(csvfile,'r')
lines=dfile.readlines()
dfile.close()
for i in range(0, len(lines)):
    lines[i]=f(lines[i])
# f(.) is a function that modifies line string if a pattern is found
# then I have a code to write the processed data in another csv file.

问题在于某些迭代之后，代码停止运行，并返回内存错误。我的系统有32GB RAM。如何提高内存性能？我尝试使用以下方法逐行读取数据：

import cache
j=1
while True:
    line=cache.getline(csvfile,j)
    if line='':
        break
    outp=open(newfile,'w')
    outp.write(f(line))
    outp.close()
    j+=1

这种方法也失败了：

encoding error reading location 0X9b?!

有解决方案吗？

如果您对我的csv文件中的功能和模式感兴趣，请瞧。这是我的csv文件的一个小示例。

Description           Effectivity                AvailableLengths  Vendors
Screw 2" length 3"    "machine1, machine2"       25mm              "vend1, ven2"
pin 3"                machine1                   2-3/4"            vend3
pin 25mm              "machine2, machine4"       34mm              "vend5,Vend6"
Filler 2" red         machine5                   "4-1/2", 3""      vend7
"descr1, descr2"      "machin1,machin2,machine3" 50                "vend1,vend4"

csv文件中的字段用逗号分隔，因此第一行是这样的：

Screw 2" length 3","machine1, machine2",25mm,"vend1, ven2"

由于多个值字段和尺寸引号的使用，csv阅读器无法读取此文件。我的函数（上面代码中的函数f）如果逗号位于属于同一字段的两个数据之间，则用分号替换逗号，如果该引号是维数，则用'INCH'替换逗号。

f(firstline)=Screw 2INCH length 3INCH,machine1;machine2,25mm,vend1;ven2

Answer 1

尝试使用以下代码进行编码错误：

open(csvfile, 'r', encoding = 'utf8')

为了提高性能，函数f（）可能具有很高的复杂性/内存消耗。

您可以在此处粘贴函数f（）吗？如果您要查找模式，也可以考虑使用正则表达式。

Python中的大型csv文件

1 个答案: