Question

我有一个巨大的文本文件（足以填满我的计算机内存），我需要在较小的文件中分开。

该文件包含CSV行，其中第一行是ID：

ID1, val11, val12, val13
ID2, val21, val22, val23
ID1, val31, val32, val33
ID3, val41, val42, val43

我想从源文件中读取每一行或一组行，并创建较小的文件，按行ID分组：

File1:
    val11, val12, val13
    val31, val32, val33
File2:
    val21, val22, val23
File3:
    val41, val42, val43

到目前为止，我可以使用以下代码执行此操作，但这需要很长时间（我没有10天时间来执行此操作）。

def groupIDs(fileName,folder):
    loadedFile = open(fileName, 'r')
    firstLine = loadedFile.readline() #skip titles

    folder += "/"

    count = 0;

    for line in loadedFile:

        elems = line.split(',')
        id = elems[0]

        rest = ""
        for elem in elems[1:]:
            rest+=elem + ","

        with open(folder+id,'a') as f:
            f.write(rest[:-1])

        #printing progress
        count+=1
        if count % 50000 == 0:
            print(count)

    loadedFile.close()

瓶颈似乎是HD性能，如资源监视器所示（CPU使用率低于20％，内存几乎没有触及）

如何才能提高效果？

Answer 1

你可以将它保持在撞击状态，并且每隔几千条线冲洗掉一次，或者当撞柱被填充到你可以选择的程度时。

您还应该将上下文管理器与文件一起使用，并使用std lib中的os.path或pathlib模块，而不是手动使用字符串作为路径。

这是一个可以节省每10000行的解决方案，根据您的问题进行调整：

import os
from glob import iglob
from collections import defaultdict


def split_files_into_categories(inputfiles, outputdir):
    count = 0
    categories = defaultdict(bytearray)

    for inputfile in inputfiles:
        with open(inputfile, 'rb')  as f:
            next(f) # skip first line

            for line in f:

                if count % 10000 == 0:
                    save_results(categories, outputdir)
                    categories.clear()

                category, _, rest = line.partition(b',')

                categories[category] += rest
                count += 1

    save_results(categories, outputdir)


def save_results(categories, outputdir):
    for category, data in categories.items():
        with open(os.path.join(outputdir, category.decode() + '.csv'), 'ab') as f:
            f.write(data)


if __name__ == '__main__':
    # run on all csvs in the data folder
    split_files_into_categories(iglob('data/*.csv'), 'by_category')

一些解释：

我以二进制模式打开文件并使用bytearray，这可以防止复制数据。在python字符串中是不可变的，因此+=创建一个新字符串并重新分配它。
defaultdict(bytearray)会在第一次访问时为每个新类别创建一个空的bytearray。

您可以将if count % 100000 == 0替换为检查内存消耗，如下所示：

import os
import psutil

process = psutil.Process(os.getpid())

然后检查

# save results if process uses more than 1GB of ram
if process.memory_info().rss > 1e9:

每个类别的文件中的文本文件的组行 - 最有效的方式

1 个答案: