Question

所以我希望使用python多处理模块创建一个进程，我希望它是一个更大的脚本的一部分。（我也想要很多其他的东西，但现在我会满足于此）

我从multiprocessing docs复制了最基本的代码并稍加修改

然而，每次调用p.join（）时，if __name__ == '__main__':语句之外的一切都会重复。

这是我的代码：

from multiprocessing import Process

data = 'The Data'
print(data)

# worker function definition
def f(p_num):
    print('Doing Process: {}'.format(p_num))

print('start of name == main ')

if __name__ == '__main__':
    print('Creating process')
    p = Process(target=f, args=(data,))
    print('Process made')
    p.start()
    print('process started')
    p.join()
    print('process joined')

print('script finished')

这是我预期：

The Data
start of name == main 
Creating process
Process made
process started
Doing Process: The Data
process joined
script finished

Process finished with exit code 0

这是现实：

The Data
start of name == main 
Creating process
Process made
process started
The Data                         <- wrongly repeated line
start of name == main            <- wrongly repeated line
script finished                  <- wrongly executed early line
Doing Process: The Data
process joined
script finished

Process finished with exit code 0

我不确定这是由if声明还是p.join()或其他内容以及为何会发生造成的。有人可以解释引起了什么，为什么？

为了清楚起见，因为有些人不能复制我的问题，但我有;我使用的是Windows Server 2012 R2 Datacenter，我使用的是python 3.5.3。

Answer 1

Multiprocessing在Python中的工作方式是每个子进程导入父脚本。在Python中，导入脚本时，会执行函数中未定义的所有内容。据我了解，__name__在导入脚本（Check this SO answer here for a better understanding）时发生了更改，这与在命令行上直接运行脚本的情况不同，这会导致__name__ == '__main__' 。此导入导致__name__不等于'__main__'，这就是为什么if __name__ == '__main__':中的代码不会为您的子流程执行。

在子进程调用期间你不想执行的任何内容都应该移到代码的if __name__ == '__main__':部分，因为这只会运行父进程，即你最初运行的脚本。

希望这会有所帮助。如果您环顾四周，Google周围还有更多资源可以更好地解释这一点。我链接了多处理模块的官方Python资源，我建议你仔细研究它。

Answer 2

探讨该主题时，我遇到了多个模块负载的问题。为了使其能够按上述方式工作，我必须：

将所有导入放入函数（initializer（））
在调用initializer（）函数时将所有导入返回为对象
在定义中引用这些对象，并调用模块中其余的函数

下面的示例模块在同一数据集上并行运行多种分类方法：

print("I am being run so often because: https://stackoverflow.com/questions/45591987/multi-processing-code-repeatedly-runs")

def initializer():
    from sklearn import datasets

    iris = datasets.load_iris()
    x = iris.data
    y = iris.target    

    from sklearn.preprocessing import StandardScaler as StandardScaler
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import Perceptron
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    import multiprocessing as mp
    from multiprocessing import Manager

    results = [] # for some reason it needs to be defined before the if __name__ = __main__

    return x, y, StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline, mp, Manager, results

def perceptron(x,y,results, StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline):
    scaler = StandardScaler()
    estimator = ["Perceptron", Perceptron(n_iter=40, eta0=0.1, random_state=1)]

    pipe =  Pipeline([('Scaler', scaler),
                      ('Estimator', estimator[1])])

    pipe.fit(x,y)

    y_pred_pipe = pipe.predict(x)
    accuracy = accuracy_score(y, y_pred_pipe)
    result = [estimator[0], estimator[1], pipe, y_pred_pipe, accuracy]
    results.append(result)
    print(estimator[0], "Accuracy: ",accuracy)
    return results

def logistic(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline):
    scaler = StandardScaler()
    estimator = ["LogisticRegression", LogisticRegression(C=100.0, random_state=1)]

    pipe =  Pipeline([('Scaler', scaler),
                      ('Estimator', estimator[1])])

    pipe.fit(x,y)

    y_pred_pipe = pipe.predict(x)
    accuracy = accuracy_score(y, y_pred_pipe)
    result = [estimator[0], estimator[1], pipe, y_pred_pipe, accuracy]
    #results = []
    results.append(result)
    print(estimator[0], "Accuracy: ",accuracy)
    return results

def parallel(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline):
    with Manager() as manager:

        tasks = [perceptron, logistic,]
        results = manager.list() 
        procs = []
        for task in tasks:
            proc = mp.Process(name=task.__name__, target=task, args=(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline))
            procs.append(proc)
            print("done with check 1")
            proc.start()
            print("done with check 2")

        for proc in procs:
            print("done with check 3")
            proc.join()
            print("done with check 4")

        results = list(results)
        print("Within WITH")
        print(results)

    print("Within def")
    print(results)
    return results 

if __name__ == '__main__':
    __spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"

    x, y, StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline, mp, Manager, results = initializer()

    results = parallel(x,y,results,StandardScaler, accuracy_score, Perceptron, LogisticRegression, Pipeline)

    print("Outside of def")
    print(type(results))
    print(len(results))

    print(results[1]) # must be within IF as otherwise does not work ?!?!?!?

    cpu_count = mp.cpu_count()
    print("CPUs: ", cpu_count)

多处理代码重复运行

2 个答案: