Question

我正在尝试在python中读取3个不同的文件并做一些事情来从中提取数据。然后我想将数据合并到一个大文件中。

由于每个单独的文件已经很大并且有时需要进行数据处理，我想是否

我可以一次读取所有三个文件（在多个线程/进程中）
等待所有文件完成的过程
当所有输出都准备好后，将所有数据传输到下游函数以合并它。

有人可以建议对此代码进行一些改进，以实现我想要的目标。

import pandas as pd

file01_output = ‘’
file02_output = ‘’
file03_output = ‘’

# I want to do all these three “with open(..)” at once.
with open(‘file01.txt’, ‘r’) as file01:
    for line in file01:
        something01 = do something in line
        file01_output += something01

with open(‘file02.txt’, ‘r’) as file01:
    for line in file01:
        something02 = do something in line
        file02_output += something02

with open(‘file03.txt’, ‘r’) as file01:
    for line in file01:
        something03 = do something in line
        file03_output += something03

def merge(a,b,c):
    a = file01_output
    b = file01_output
    c = file01_output

    # compile the list of dataframes you want to merge
    data_frames = [a, b, c]

    df_merged = reduce(lambda  left,right: pd.merge(left,right,
                       on=['common_column'], how='outer'), data_frames).fillna('.')

Answer 1

在您的问题中使用多处理的方法有很多，所以我只提出一种方法。正如您所提到的，因为文件中的数据发生的处理是CPU限制的，您可以在单独的进程中运行它，并期望看到一些改进（如果有的话，改进多少取决于问题，算法，＃核心，等等。）。例如，整体结构可能看起来只有一个pool，您map列出了您需要处理的所有filenames，并在该功能中进行计算。

通过一个具体的例子，它变得更容易了。让我们假设我们有一个CSV 'file01.csv', 'file02.csv', 'file03.csv'列表，其列表NUMBER，我们想要计算该数字是否为素数（CPU绑定）。示例，file01.csv：

NUMBER
1
2
3
...

其他文件看起来相似，但数字不同，以避免重复工作。计算素数的代码可能如下所示：

import pandas as pd
from multiprocessing import Pool
from sympy import isprime

def compute(filename):
    # IO (probably not faster)
    my_data_df = pd.read_csv(filename)

    # do some computing (CPU)
    my_data_df['IS_PRIME'] = my_data_df.NUMBER.map(isprime)

    return my_data_df

if __name__ == '__main__':
    filenames = ['file01.csv', 'file02.csv', 'file03.csv']

    # construct the pool and map to the workers
    with Pool(2) as pool:
        results = pool.map(compute, filenames)
    print(pd.concat(results))

我已经使用sympy软件包提供了方便的isprime方法，并且我确信我的数据结构完全不同，但希望这个示例说明了一个结构你也可以使用。在pool（或Process es列表）中执行CPU绑定计算然后合并/缩小/连接结果的计划是解决问题的合理方法。

如何将多个文件读入多个线程/进程以优化数据分析？

1 个答案: