Question

我正在处理一个大的CSV数据文件，其文件包括：user_id，timestamp，category，我正在为每个用户构建每个类别的分数。我首先分块CSV文件，并在块文件上应用groupby（在user_id的最后两个数字上），这样我就可以存储包含用户组的总共100个文件，并存储他们在HDF5商店。

然后我在我的商店上制作了一个大的for循环来处理每个存储的文件，一个接一个。对于它们中的每一个，我在user_id上进行分组，然后计算用户的分数。然后我写了一个输出CSV，每个用户有一行，包含他的所有分数。

我注意到这个主循环在我的个人计算机上花了4个小时，我想加速它，因为它看起来完全可并行化。我怎么能够？我想到了multiprocessing或hadoop streaming，最好的是什么？

这是我的（简化）代码：

def sub_group_hash(x):
    return x['user_id'].str[-2:]

reader = read_csv('input.csv', chunksize=500000)                                  
with get_store('grouped_input.h5') as store:
    for chunk in reader:
        groups = chunk.groupby(sub_group_hash(chunk))
        for grp, grouped in groups:
            store.append('group_%s' % grp, grouped,
                 data_columns=['user_id','timestamp','category'])

with open('stats.csv','wb') as outfile:
    spamwriter = csv.writer(outfile)
    with get_store('grouped_input.h5') as store:
        for grp in store.keys(): #this is the loop I would like to parallelize
            grouped = store.select(grp).groupby('user_id')
            for user, user_group in grouped:
                output = my_function(user,user_group)
                spamwriter.writerow([user] + output)

Answer 1

我会推荐多线程。线程库非常简单直观。 https://docs.python.org/3/library/threading.html#thread-objects

我对你的主循环是什么意思有点困惑，但我假设它的所有上述过程。如果是这种情况，请将其包含在定义中并使用更简单的

上下文

import thread
t1 = threading.thread(process, ("any", "inputs"))
t1.start()

这里可以找到体面的教程。如果您熟悉python使用它，它还会向您展示更高级的线程技术。 http://www.tutorialspoint.com/python/python_multithreading.htm

棘手的是，当您写入文件时，您不希望所有进程一次写入文件，但幸运的是，您可以使用锁定创建一个阻塞点。围绕此过程的acquire()和release()函数将确保一次只有一个线程正在写入。

还要注意您的电脑上有多少核心。如果你运行更多的线程然后在你的电脑上核心，那么每个线程将不得不等待CPU时间，你在速度方面没有获得那么多。如果您创建无限量的进程，您也可以非常轻松地对计算机进行分叉。

使用muliprocessing或hadoop加速大数据上的python-pandas脚本

1 个答案: