Question

我使用文件中的条件查找来填充矩阵。该文件非常大（25,00,000条记录）并保存为数据框（＆＃39;文件＆＃39;）。每个矩阵行操作（查找）独立于另一个。无论如何我可以并行化这个过程吗？

我在pandas和python工作。我目前的方法是天真的。

for r in row:
    for c in column:
        num=file[(file['Unique_Inventor_Number']==r) & file['AppYearStr']==c)]['Citation'].tolist()
        num = len(list(set(num)))
        d.set_value(r, c, num)

Answer 1

你应该可以做250万条记录

res = file.groupby(['Unique_Inventor_Number', 'AppYearStr']).Citation.nunique()

矩阵应该可以在

中找到

res.unstack(level=1).fillna(0).values

我不确定它是否是最快的，但应该比你的实现快得多

Answer 2

[编辑]正如Roland在评论中提到的，在标准的Python实现中，这篇文章没有提供任何改善CPU性能的解决方案。

在标准的Python实现中，线程并没有真正提高CPU绑定任务的性能。有一个＆＃34; Global Interpreter Lock＆＃34;强制执行一次只有一个线程可以执行Python字节码。这样做是为了降低内存管理的复杂性。

您是否尝试过针对不同功能使用不同的线程？

我们假设您将数据框分成列并创建多个线程。然后分配每个线程以将函数应用于列。如果你有足够的处理能力，你可能会获得很多时间：

from threading import Thread
import pandas as pd
import numpy as np
from queue import Queue
from time import time

# Those will be used afterwards
N_THREAD = 8
q = Queue()
df2 = pd.DataFrame()  # The output of the script 

# You create the job that each thread will do
def apply(series, func):
    df2[series.name] = series.map(func)


# You define the context of the jobs
def threader():
    while True:
        worker = q.get()
        apply(*worker)
        q.task_done()

def main():

    # You import your data to a pandas dataframe
    df = pd.DataFrame(np.random.randn(100000,4), columns=['A', 'B', 'C', 'D'])

    # You create the functions you will apply to your columns
    func1 = lambda x: x<10
    func2 = lambda x: x==0
    func3 = lambda x: x>=0
    func4 = lambda x: x<0
    func_rep = [func1, func2, func3, func4]

    for x in range(N_THREAD):  # You create your threads    
        t = Thread(target=threader)
        t.start()

    # Now is the tricky part: You enclose the arguments that
    # will be passed to the function into a tuple which you
    # put into a queue. Then you start the job by "joining"
    # the queue
    for i, func in enumerate(func_rep):
        worker = tuple([df.iloc[:,i], func])
        q.put(worker)

    t0 = time()
    q.join()
    print("Entire job took: {:.3} s.".format(time() - t0))

if __name__ == '__main__':
    main()

并行化python中的操作

2 个答案: