Question

我正在尝试使用多处理来提高我的Python程序速度，但实际上并没有创建更多进程。我看过一些教程，但是我什么都没有。

这里是：

    cpuutil = int((multiprocessing.cpu_count()) / 2)
    p = Pool(processes = cpuutil)
    output = p.map(OSGBtoETRSfunc(data, eastcol, northcol))
    p.close()
    p.join()
    return output

对我来说，这应该在四核计算机上创建2个进程，但事实并非如此。我的CPU使用率约为18％...

有见识吗？它看起来与我看过的教程相同...在方括号（[]）中列出参数时，p.map无法正常工作，因此我想它必须使用上面的语法？

谢谢

Answer 1

在Python中，您应该提供分开的函数和参数。如果不是，则在创建过程时正在执行功能OSGBtoETRSfunc。相反，您应该提供指向函数的指针以及带有参数的列表。

您的情况类似于Python文档上显示的情况：https://docs.python.org/3.7/library/multiprocessing.html#introduction

无论如何，我认为您使用的是错误的功能。 Pool.map（）用作地图：在项目列表上，并对每个项目应用相同的功能。我认为您的函数OSGBtoERTSfunc需要三个参数才能正常工作。请不要使用p.map（），而要使用p.apply（）

cpuutil = int((multiprocessing.cpu_count()) / 2)
p = Pool(processes = cpuutil)
output = p.apply(OSGBtoETRSfunc, [data, eastcol, northcol])
p.close()
p.join()
return output

Answer 2

我不清楚您想要什么，所以让我们从简单开始。以下是一种简单地在pd数据帧的行上调用相同函数的方法：

import pandas as pd
import numpy as np
import os

import pathos
from contextlib import closing

NUM_PROCESSES = os.cpu_count()
# create some data frame 100x4

nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))

# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process

# function to be run over rows
# it transforms the given row independently
def foo(idx):
    # extract given row to numpy
    row = df.iloc[[idx]].values[0]
    # you can pass ranges:
    # df[2:3]

    # transform row
    # I return it as list for simplicity of creating dataframe
    row = np.exp(row)

    # return numpy row
    return row


# run pool over range of indexes (0,1, ... , nrow-1)
# and close it afterwars
# there is not reason here to have more workers than number of CPUs
with closing(pathos.multiprocessing.Pool(processes=NUM_PROCESSES)) as pool:    
    results = pool.map(foo, range(nrow))

# create new dataframe from all those numpy slices:
col_names = df.columns.values.tolist()
df_new = pd.DataFrame(np.array(results), columns=col_names)

您的计算中需要什么更复杂的设置？

编辑：好的，这里同时运行两个函数（我对熊猫不太熟悉，所以只需切换到numpy）：

# RUNNING TWO FUNCTIONS SIMLTANEOUSLY

import pandas as pd
import numpy as np

from multiprocessing import Process, Queue

# create some data frame 100x4

nrow = 100
ncol = 4
df = pd.DataFrame(np.random.randint(0,100,size=(nrow, ncol)), columns=list('ABCD'))

# dataframe resides in global scope
# so it is accessible to processes spawned below
# I pass only row indices to each process

# function to be run over part1 independently
def proc_func1(q1):

    # get data from queue1
    data1 = q1.get()

    # I extract given data to numpy
    data_numpy = data1.values

    # do something
    data_numpy_new = data_numpy + 1

    # return numpy array to queue 1
    q1.put(data_numpy_new)

    return 


# function to be run over part2 independently
def proc_func2(q2):

    # get data from queue2
    data2 = q2.get()


    # I extract given data to numpy
    data_numpy = data2.values

    # do something
    data_numpy_new = data_numpy - 1


    # return numpy array to queue 2
    q2.put(data_numpy_new)

    return


# instantiate queues
q1 = Queue()
q2 = Queue()

# divide data frame into two parts

part1 = df[:50]
part2 = df[50:]

# send data, so it will already be in queries
q1.put(part1)
q2.put(part2)

# start two processes 
p1 = Process(target=proc_func1, args=(q1,))
p2 = Process(target=proc_func2, args=(q2,))

p1.start()
p2.start()

# wait until they finish
p1.join()
p2.join()


# read results from Queues

res1 = q1.get()
res2 = q2.get()

if (res1 is None) or (res2 is None):
    print('Error!')


# reassemble two results back to single dataframe (might be inefficient)
col_names = df.columns.values.tolist()
# concatenate results along x axis
df_new = pd.DataFrame(np.concatenate([np.array(res1), np.array(res2)], axis=0), columns=col_names)

多处理不会创建任何额外的过程

2 个答案: