使用multiprocessing与multiprocessing.dummy

时间:2019-02-08 19:05:35

标签: python pandas dataframe multiprocessing

我有一个可以并行运行的长期任务。我使用multiprocessing.dummy调试了代码。效果很好,我得到了预期的结果。但是,当我将其更改为multiprocessing时,它将无法快速运行_test函数,甚至无法触及实际输出

工作是要用数据填充熊猫数据帧,直到达到某个行计数阈值。 while周期中的每个较长的进程一次运行都会增加约2500行。数据采集​​独立于其他过程。

想法是,进程在彼此之间通过Queue传递DataFrame并使用锁来阻止其他进程对其进行访问。工作完成后,他们将其放回并释放锁定。

一旦DataFrame填充到所需的大小,进程可以结束并且不再需要其他进程来完成(但是我不确定,如果它们在没有join()的情况下完成或者它们发生了什么而终止了,那么-is_alive ()检查可以替换.join())

在此示例中,TRAINING_DATA_LENGTH仅设置为10k,但实际大小会高得多

问题是,当我从multiprocessing.dummy更改为multiprocessing时,整个操作在0.7秒内完成,返回的X大小为0


  • 也许还有另一种方法,但我还没有意识到。

  • 我还需要它在单独的文件中运行,而不是__main__


test_mp.py

import pandas as pd
import multiprocessing
from multiprocessing import Process,Queue,Lock
import time
import numpy as np


TRAINING_DATA_LENGTH = 10e3

def get_training_data_mp(testing = False,updating = False):    
    s = time.time()
    processes = []
    output = Queue()
    X = pd.DataFrame([])
    output.put(X)
    lock = Lock()

    for i in range(multiprocessing.cpu_count()):           
        p = Process(target=_test,args=(testing,updating,5000,1000,lock,output))
        p.daemon = True
        p.start()
        processes.append(p)

    print([p.is_alive() for p in processes])
#    while all([p.is_alive() for p in processes]):
#        print('alive')    
#        time.sleep(3)            

    for process in processes:
        process.join()               
    print('finished')  

    X = output.get()
    e = time.time()
    print(e-s)
    return X

def _test(testing,updating,max_test_amount,max_train_amount_from_last_days,lock,output):
    time.sleep(2) # short init work

    lock.acquire()   
    X = output.get() 

    while (((not testing or updating) and X.shape[0]<TRAINING_DATA_LENGTH)     or 
           (testing and X.shape[0]<max_test_amount)):

        if updating and X.shape[0]<max_train_amount_from_last_days:
            output.put(X)
            lock.release()

            time.sleep(2) # long work
            action = '1'
        elif (testing and X.shape[0]<max_test_amount*0.25) and not updating:
            output.put(X)
            lock.release()

            time.sleep(2) # long work
            action = '2'
        else:
            output.put(X)
            lock.release()

            time.sleep(2) # long work
            action = '3'               

        time.sleep(5) # main long work
        x = pd.DataFrame(np.random.randint(0,10000,size=(2500, 4)), columns=list('ABCD')) # simulated result

        lock.acquire()
        X = output.get()
        if X.shape[0] == 0:
            X = x
        else:
            X = X.append(x)   

        # correcting output    
        X = X.drop_duplicates(keep='first')
        X.reset_index(drop=True,inplace = True)
        time.sleep(0.5) # short work

    output.put(X)    
    lock.release() 

并运行另一个文件

import test_mp
X = test_mp.get_training_data_mp(True)
print(X.shape[0])

使用multiprocessing.dummy,我得到以下输出:

[True, True, True, True]
finished
17.01797342300415
12500

带有multiprocessing,其:

[True, True, True, True]
finished
0.7530431747436523 # due to time.sleep() its impossible to be finished this fast
0 # expected >= TRAINING_DATA_LENGTH

1 个答案:

答案 0 :(得分:0)

在运行文件中添加if __name__ == '__main__':使代码执行并获得“某些”结果。但是经过更多测试,似乎只使用了一个内核(或者我的代码有问题)

import test_mp
if __name__ == '__main__':
    X = test_mp.get_training_data_mp()
    print([x.shape[0] for x in X])

test_mp.py

import multiprocessing
from multiprocessing import Process,Queue,Lock
import time
import numpy as np


TRAINING_DATA_LENGTH = 10e3

def get_training_data_mp(testing = False,updating = False): 
    s = time.time()
    processes = []
    output = Queue()
    X = []
    x = [X,X,X,X]
    output.put(x)
    lock = Lock()

    for i in range(multiprocessing.cpu_count()):           
        p = Process(target=_test,args=(i,testing,updating,5000,1000,lock,output))
        p.daemon = True
        p.start()
        processes.append(p)

    while all([p.is_alive() for p in processes]):  
        lock.acquire()
        x = output.get()
        print([len(X) for X in x])
        output.put(x)
        lock.release()
        time.sleep(3)    

    print([p.is_alive() for p in processes])
#    for process in processes:
#        process.join()               
    print('finished') 

    x = output.get()
    my_x = x

    e = time.time()
    print(e-s)
    return my_x

def _test(i,testing,updating,max_test_amount,max_train_amount_from_last_days,lock,output):
    time.sleep(2) # long work

    lock.acquire()   
    x = output.get()
    X = x[i]

    while (((not testing or updating) and len(X)<TRAINING_DATA_LENGTH) or 
           (testing and len(X)<max_test_amount)):

        x[i] = X
        output.put(x)
        lock.release()              

        y = np.array(np.random.randint(0,10000,size=(2500, 4)))
        time.sleep(2) # main long work

        lock.acquire()
        X = output.get()
        X = x[i]
        if len(X) == 0:
            X = y
        else:
            X = np.append(X,y,axis=0)   
        # correcting output    
        time.sleep(0.5) # short work
    x[i] = X        
    output.put(x)    
    lock.release()

使用multiprocessing.dummy,我得到以下输出:

[0, 0, 0, 0]
[5000, 0, 5000, 0]
[False, True, True, True]
finished
7.50442910194397
[10000, 7500, 7500, 2500] # All processes were obtaining data <- intended

带有multiprocessing,其:

[0, 0, 0, 0]
[0, 0, 2500, 0]
[0, 10000, 0, 0]
[False, False, True, False]
finished
12.15569543838501
[0, 0, 10000, 0] # Only one process was obtaining data <- wrong

已解决

time.sleep()并不强调处理器,而是将其切换为

之类的功能时
def sleep():
    n = 0
    for i in range(6000):
        n = i**i

multiprocessing.dummymultiprocessing的结果均符合预期-两者返回相同的长度,但是多处理速度快N倍