Question

我有一个包含多个列的大型A.csv文件（~5 Gb）。其中一列是Model。还有另一个包含Vendor，Name和Model列的大型B.csv文件（~15 Gb）。

两个问题：

1）如何创建结果文件，该文件结合了来自A.csv的所有列以及来自B.csv的相应Vendor和Name（加入Model）。诀窍是 - 当我的RAM仅为4 Gb时，如何使用它，而我正在使用python。

2）如何创建一个样本（例如，1 Gb）结果文件，该文件结合了来自B.csv的Vendor和Name加入的A.csv（所有列）的随机子样本。诀窍再次是4 Gb的RAM。

我知道如何在熊猫中做到这一点，但4 Gb是我无法克服的限制因素（

Answer 1

这是一个想法：

第1步：按模型对两个文件进行排序。 Mergesort对此有好处。将每个文件拆分成小到足以在RAM中排序的文件，然后在对每个文件进行排序后，将它们合并为一个大的排序文件。有关合并多个已排序文件的好方法，请参阅my answer to an earlier question。 更新：请参阅示例/代码的答案结尾。

步骤2：按模型加入两个文件。再次类似于Mergesort的合并步骤，“并行”遍历两个已排序的文件，适当地推进每个文件，并通过匹配模型值进行连接

第2步的伪代码：

open the two sorted files A and B
blockA = read block of same-model rows from A
blockB = read block of same-model rows from B
while True:
    while model of blockA differs from model of blockB:
        if model of blockA is smaller:
            blockA = read block of same-model rows from A
            quit if there isn't any (i.e. end of file reached)
        else:
            blockB = read block of same-model rows from B
            quit if there isn't any (i.e. end of file reached)
    output the cross product of blockA and blockB

另一个想法：

如果模型相对较少，那么最好通过Model将行分隔成文件。例如，将行存储在文件A_Model1.csv，A_Model2.csv等和B_Model1.csv，B_Model2.csv等中。然后使用A_Model1.csv和{{1的叉积，B_Model1.csv和A_Model2.csv等等。

对于问题2，我只计算行数，使用B_Model2.csv选择行号，然后获取这些行。

random.sample

更新：以下是上面第2步的代码/演示。我制作了三个文件B1.csv，B2.csv和B3.csv：

>>> import random
>>> number_of_rows = 100
>>> number_of_sample_rows = 10
>>> sorted(random.sample(range(number_of_rows), number_of_sample_rows))
[6, 18, 23, 32, 41, 44, 58, 59, 91, 96]

(then go through the file and fetch those rows)

这是合并结果文件Bmerged.csv：

Vendor,Name,Model
vfoo,nhi,m1
vbar,nho,m4
vbaz,nhe,m7

Vendor,Name,Model
vZ,nX,m2
vY,nZ,m6
vX,nY,m8

Vendor,Name,Model
v,n3,m3
v,na,m5
v,n_,m9

这是代码：

Vendor,Name,Model
vfoo,nhi,m1
vZ,nX,m2
v,n3,m3
vbar,nho,m4
v,na,m5
vY,nZ,m6
vbaz,nhe,m7
vX,nY,m8
v,n_,m9

请注意，我正在使用Python 3.在Python 2中，您需要使用import csv, heapq filenames = ('B1.csv', 'B2.csv', 'B3.csv') # Prepare the input streams files = list(map(open, filenames)) readers = [iter(csv.reader(file)) for file in files] headers = list(map(next, readers)) def model_and_row(row): return row[2], row model_and_row_streams = [map(model_and_row, reader) for reader in readers] # Merge them into the output file with open('Bmerged.csv', 'w', newline='') as outfile: writer = csv.writer(outfile) writer.writerow(headers[0]) for _, row in heapq.merge(*model_and_row_streams): writer.writerow(row) # Close the input files for file in files: file.close()才能不立即将整个文件读入内存。

Answer 2

正如@Marc B所说，一次读一行是解决方案。关于连接我会做以下（伪代码：我不知道python）。

“在第一个文件A.csv

读取所有行，搜索“模型”字段并收集列表/数组/映射中的不同值

“从B中选择不同的模型”在第二个文件B.csv

与1相同的操作，但使用另一个列表/数组/地图

查找匹配的模型

比较两个列表/数组/地图，只找到匹配的模型（它们将成为连接的一部分）

加入

读取与模型匹配的文件A的行，读取与相同模型匹配的文件B的所有行，并使用连接结果写入文件C.适用于所有型号。

注意：它没有特别优化。

对于第2点，只需选择匹配模型的子集和/或使用加工模型读取文件A和/或B的部分行。

Answer 3

在Python中逐行读取文件。这是一种非常简单快捷的方法：例如

output = open("outputfile.csv", "a")
lines = []
for line in open("file.csv", "r"):
    lines.append(line)
    if len(lines) == 1000000:
        output.writelines(lines)
        del lines[:]
if bool(lines):
    output.writelines(lines)

根据可用RAM

调整if语句中数组的长度

使用有限的RAM以类似sql的方式连接大文件

3 个答案: