Question

我正在尝试使用Python从大约3亿行和大约200 GB的大文件中读取和写入数据。我已经能够使基本代码正常工作，但想对其进行并行化以使其运行得更快。为此，我一直在遵循本指南：https://www.blopig.com/blog/2016/08/processing-large-files-using-python/。但是，当我尝试并行化代码时，出现错误：“ *之后的TypeError：worker（）参数必须是可迭代的，而不是int”。我如何才能运行代码，您对提高效率有任何建议吗？请注意，我是Python的新手。

基本代码（其中设置了id_pct1和id_pct001）：

with open(file1) as f, open('file1', 'w') as out_f1, open('file2', 'w') as out_f001:
        for line in f:
            data = line.split('*')
            if data[30] in id_pct1: out_f1.write(line)
            if data[30] in id_pct001: out_f001.write(line)

并行代码：

def worker(lineByte):
      with open(file1) as f, open('file1', 'w') as out_f1, open('file2', 'w') as out_f001:
            f.seek(lineByte)
            line = f.readline()
            data = line.split('*')
            if data[30] in id_pct1: out_f1.write(line)
            if data[30] in id_pct001: out_f001.write(line)


def main():
   pool = mp.Pool()
   jobs = []

   with open('Subsets/FirstLines.txt') as f:
        nextLineByte = 0
        for line in f:
            jobs.append(pool.apply_async(worker,(nextLineByte)))
            nextLineByte += len(line)

        for job in jobs:
            job.get()

        pool.close()

if __name__ == '__main__':
    main()

Answer 1

尝试

 jobs.append(pool.apply_async(worker,(nextLineByte,)))

pool.apply_async（）需要可迭代。

（nextLineByte）用作int，这是引发的错误。

您如何并行化程序以使用python读写大型文件？

1 个答案: