我有两个文件。 “A”不是太大(2GB)而“B”是相当大的60GB。我有一个原始代码如下:
import csv #imports module csv
filea = "A.csv"
fileb = "B.csv"
output = "Python_modified.csv"
source1 = csv.reader(open(filea,"r"),delimiter='\t')
source2 = csv.reader(open(fileb,"r"),delimiter='\t')
#open csv readers
source2_dict = {}
# prepare changes from file B
for row in source2:
source2_dict[row[2]] = row[2]
# write new changed rows
with open(output, "w") as fout:
csvwriter = csv.writer(fout, delimiter='\t')
for row in source1:
# needs to check whether there are any changes prepared
if row[3] in source2_dict:
# change the item
row[3] = source2_dict[row[3]]
csvwriter.writerow(row)
如果存在匹配项,则应从两个文件中读取第3列,并将文件A中的第4列替换为文件B中第4列的内容。然而,因为它在大文件中读取它非常慢。有没有办法优化这个?
答案 0 :(得分:1)
您可以尝试将大块中的file_a
读入内存,然后处理每个块。这意味着你正在进行一组读取,然后是一组写入,这应该有助于减少磁盘抖动。您需要决定使用哪个block_size
,这显然适合在记忆中使用。
from itertools import islice
import csv #imports module csv
file_a = "A.csv"
file_b = "B.csv"
output = "Python_modified.csv"
block_size = 10000
# prepare changes from file B
source2_dict = {}
with open(file_b, 'rb') as f_source2:
for row in csv.reader(f_source2, delimiter='\t'):
source2_dict[row[3]] = row[4] # just store the replacement value
# write new changed rows
with open(file_a, 'rb') as f_source1, open(output, "wb") as f_output:
csv_source1 = csv.reader(f_source1, delimiter='\t')
csv_output = csv.writer(f_output, delimiter='\t')
# read input file_a in large groups
for block in iter(lambda: list(islice(csv_source1, block_size)), []):
for row in block:
try:
row[4] = source2_dict[row[3]]
except KeyError as e:
pass
csv_output.writerow(row)
其次,为了减少内存使用量,如果只是替换一个值,那么只需将一个值存储在字典中。
使用Python 2.x进行测试。如果您使用的是Python 3.x,则需要更改文件,例如
with open(file_b, 'r', newline='') as f_source2: