问题：

Question

请不要被长篇文章劝阻。我尽量提供尽可能多的数据，我真的需要帮助解决这个问题：S。如果有新的提示或想法，我会每天更新

问题：

我尝试在并行进程的帮助下并行运行两个核心机器上的Python代码（以避免GIL），但是存在代码明显变慢的问题。例如，在一个核心机器上运行每个工作负载需要600秒，但在两个核心机器上运行需要1600秒（每个工作负载800秒）。

我已经尝试过：

我测量了内存，似乎没有内存问题。 [只在高点使用20％]。
我使用“htop”检查我是否真的在不同的核心上运行程序，或者我的核心亲和力是否搞砸了。但也没有运气，我的程序在我的所有内核上运行。
问题是一个CPU限制的问题，所以我检查并确认我的代码在大多数情况下在所有内核上以100％CPU运行。
我检查了进程ID，我确实产生了两个不同的进程。
我将我提交的执行函数[e.submit（function，[...]）]中的函数更改为calculate-pie函数，并观察到了巨大的加速。所以问题很可能发生在我的process_function（...）中，我将其提交到执行程序而不是之前的代码中。
目前我正在使用＆＃34;期货＆＃34;来自＆＃34;并发＆＃34;使任务平行化。但我也尝试了＃34;池＆＃34;来自＆＃34;多处理＆＃34;的课程。但结果仍然相同。

代码：

产生过程：

result = [None]*psutil.cpu_count()

e = futures.ProcessPoolExecutor( max_workers=psutil.cpu_count() )

for i in range(psutil.cpu_count()):
    result[i] = e.submit(process_function, ...)

process_function：

from math import floor
from math import ceil
import numpy
import MySQLdb
import time

db = MySQLdb.connect(...)
cursor  = db.cursor()
query = "SELECT ...."
cursor.execute(query)

[...]  #save db results into the variable db_matrix (30 columns, 5.000 rows)
[...]  #save db results into the variable bp_vector (3 columns, 500 rows)
[...]  #save db results into the variable option_vector( 3 columns, 4000 rows)

cursor.close()
db.close()

counter = 0 

for i in range(4000):
    for j in range(500):
         helper[:] = (1-bp_vector[j,0]-bp_vector[j,1]-bp_vector[j,2])*db_matrix[:,0] 
                     + db_matrix[:,option_vector[i,0]] * bp_vector[j,0]  
                     + db_matrix[:,option_vector[i,1]] * bp_vector[j,1]   
                     + db_matrix[:,option_vector[i,2]] * bp_vector[j,2]

         result[counter,0] = (helper < -7.55).sum()

         counter = counter + 1

return result

我猜：

我的猜测是，由于某种原因，称重矢量乘法产生了矢量＆＃34;帮助＆＃34;造成了问题。 [我相信时间测量部分证实了这个猜测]
可能是这样，numpy会产生这些问题吗？ numpy与多处理兼容吗？如果没有，我该怎么办？ [已在评论中回答]
是否因为缓存而存在这种情况？我在论坛上看过它，但说实话，并没有真正理解它。但如果问题根源于此，我会让自己熟悉这个话题。

时间测量:(编辑）

一个核心：从数据库获取数据的时间：8秒
两个核心：从数据库获取数据的时间：12秒
一个核心：在process_function中进行双循环的时间：~640秒。
两个核心：在process_function中进行双循环的时间：~1600秒

更新:(编辑）

当我在循环中每100个i测量两个进程的时间时，我发现当我在一个进程上运行时测量相同的东西时，它大约是我观察到的时间的220％。但更神秘的是，如果我在运行期间退出流程，另一个流程会加速！然后另一个过程实际上加速到单独运行期间的水平。因此，目前我还没有看到的流程之间必然存在一些依赖关系：S

更新-2 :(编辑）

所以，我做了一些测试运行和测量。在测试运行中，我用作计算实例的单核linux机器（n1-standard-1,1 vCPU，3.75 GB内存）或双核linux机器<来自Google云计算引擎的/ strong>（n1-standard-2,2 vCPU，7.5 GB内存）。但是，我也在我的本地计算机上进行了测试，并观察到大致相同的结果。（ - ＆gt;因此，虚拟化环境应该没问题）。结果如下：

P.S：这里的时间与上面的测量值不同，因为我稍微限制了循环并在Google Cloud上进行了测试，而不是在家用电脑上进行测试。


1核机器，启动1个进程：

时间：225秒，CPU利用率：~100％

1核机器，开始2个过程：

时间：557秒，CPU利用率：~100％

1核机器，启动1个进程，限制最大值CPU利用率达到50％：

时间：488秒，CPU利用率：~50％


2核机器，开始2个过程：

时间：665秒，CPU-1利用率：~100％，CPU-2利用率：~100％

进程没有在核心之间跳跃，每个核心使用1核心

（至少htop用“Process”栏显示了这些结果）

2核机器，启动1个进程：

时间：222秒，CPU-1利用率：~100％（0％），CPU-2利用率：~0％（100％）
     然而，这个过程有时会在核心之间跳跃

2核机器，启动1个进程，限制最大值CPU利用率达到50％：

时间：493秒，CPU-1利用率：~50％（0％），CPU-2利用率：~0％（100％）
     但是，这个过程在核心之间经常跳跃

我用过＆＃34; htop＆＃34;和python模块＆＃34;时间＆＃34;获得这些结果。

更新 - 3 :(编辑）

我使用cProfile来分析我的代码：

python -m cProfile -s cumtime fun_name.py

文件太长，无法在此发布，但我相信如果它们包含有价值的信息，这些信息可能是结果文本之上的信息。因此，我将在此处发布结果的第一行：

1核机器，启动1个进程：

623158 function calls (622735 primitive calls) in 229.286 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.371 0.371 229.287 229.287 20_with_multiprocessing.py:1(<module>) 3 0.000 0.000 225.082 75.027 threading.py:309(wait) 1 0.000 0.000 225.082 225.082 _base.py:378(result) 25 225.082 9.003 225.082 9.003 {method 'acquire' of 'thread.lock' objects} 1 0.598 0.598 3.081 3.081 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.877 0.959 cursors.py:164(execute) 3 0.000 0.000 2.877 0.959 cursors.py:353(_query) 3 0.000 0.000 1.958 0.653 cursors.py:315(_do_query) 3 0.000 0.000 1.943 0.648 cursors.py:142(_do_get_result) 3 0.000 0.000 1.943 0.648 cursors.py:351(_get_result) 3 1.943 0.648 1.943 0.648 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.919 0.306 cursors.py:358(_post_get_result) 3 0.000 0.000 0.917 0.306 cursors.py:324(_fetch_row) 3 0.917 0.306 0.917 0.306 {built-in method fetch_row} 591314 0.161 0.000 0.161 0.000 {range}

1核机器，开始2个过程：

626052 function calls (625616 primitive calls) in 578.086 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.310 0.310 578.087 578.087 20_with_multiprocessing.py:1(<module>) 30 574.310 19.144 574.310 19.144 {method 'acquire' of 'thread.lock' objects} 2 0.000 0.000 574.310 287.155 _base.py:378(result) 3 0.000 0.000 574.310 191.437 threading.py:309(wait) 1 0.544 0.544 2.854 2.854 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.563 0.854 cursors.py:164(execute) 3 0.000 0.000 2.563 0.854 cursors.py:353(_query) 3 0.000 0.000 1.715 0.572 cursors.py:315(_do_query) 3 0.000 0.000 1.701 0.567 cursors.py:142(_do_get_result) 3 0.000 0.000 1.701 0.567 cursors.py:351(_get_result) 3 1.701 0.567 1.701 0.567 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.848 0.283 cursors.py:358(_post_get_result) 3 0.000 0.000 0.847 0.282 cursors.py:324(_fetch_row) 3 0.847 0.282 0.847 0.282 {built-in method fetch_row} 591343 0.152 0.000 0.152 0.000 {range}

2核机器，启动1个进程：

623164 function calls (622741 primitive calls) in 235.954 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.246 0.246 235.955 235.955 20_with_multiprocessing.py:1(<module>) 3 0.000 0.000 232.003 77.334 threading.py:309(wait) 25 232.003 9.280 232.003 9.280 {method 'acquire' of 'thread.lock' objects} 1 0.000 0.000 232.003 232.003 _base.py:378(result) 1 0.593 0.593 3.104 3.104 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.774 0.925 cursors.py:164(execute) 3 0.000 0.000 2.774 0.925 cursors.py:353(_query) 3 0.000 0.000 1.981 0.660 cursors.py:315(_do_query) 3 0.000 0.000 1.970 0.657 cursors.py:142(_do_get_result) 3 0.000 0.000 1.969 0.656 cursors.py:351(_get_result) 3 1.969 0.656 1.969 0.656 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.794 0.265 cursors.py:358(_post_get_result) 3 0.000 0.000 0.792 0.264 cursors.py:324(_fetch_row) 3 0.792 0.264 0.792 0.264 {built-in method fetch_row} 591314 0.144 0.000 0.144 0.000 {range}

2核机器，开始2个过程：

626072 function calls (625636 primitive calls) in 682.460 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.334 0.334 682.461 682.461 20_with_multiprocessing.py:1(<module>) 4 0.000 0.000 678.231 169.558 threading.py:309(wait) 33 678.230 20.552 678.230 20.552 {method 'acquire' of 'thread.lock' objects} 2 0.000 0.000 678.230 339.115 _base.py:378(result) 1 0.527 0.527 2.974 2.974 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.723 0.908 cursors.py:164(execute) 3 0.000 0.000 2.723 0.908 cursors.py:353(_query) 3 0.000 0.000 1.749 0.583 cursors.py:315(_do_query) 3 0.000 0.000 1.736 0.579 cursors.py:142(_do_get_result) 3 0.000 0.000 1.736 0.579 cursors.py:351(_get_result) 3 1.736 0.579 1.736 0.579 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.975 0.325 cursors.py:358(_post_get_result) 3 0.000 0.000 0.973 0.324 cursors.py:324(_fetch_row) 3 0.973 0.324 0.973 0.324 {built-in method fetch_row} 5 0.093 0.019 0.304 0.061 __init__.py:1(<module>) 1 0.017 0.017 0.275 0.275 __init__.py:106(<module>) 1 0.005 0.005 0.198 0.198 add_newdocs.py:10(<module>) 591343 0.148 0.000 0.148 0.000 {range}

我个人而言，我真的不知道如何处理这些结果。很高兴收到提示，提示或任何其他帮助 - 谢谢：）

回答答案-1 :(编辑）

罗兰史密斯查看数据并提出建议，多处理可能会对性能造成的影响大于它的帮助。因此，我在没有多处理的情况下再做了一次测量（比如他建议的代码）：

我在结论中是否正确，事实并非如此？因为测量的时间似乎与多处理之前测量的时间相似？

单核机器：


数据库访问耗时2.53秒

矩阵操作需要236.71秒

1842384 function calls (1841974 primitive calls) in 241.114 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 219.036 219.036 241.115 241.115 20_with_multiprocessing.py:1(<module>) 406000 0.873 0.000 18.097 0.000 {method 'sum' of 'numpy.ndarray' objects} 406000 0.502 0.000 17.224 0.000 _methods.py:31(_sum) 406001 16.722 0.000 16.722 0.000 {method 'reduce' of 'numpy.ufunc' objects} 1 0.587 0.587 3.222 3.222 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.964 0.988 cursors.py:164(execute) 3 0.000 0.000 2.964 0.988 cursors.py:353(_query) 3 0.000 0.000 1.958 0.653 cursors.py:315(_do_query) 3 0.000 0.000 1.944 0.648 cursors.py:142(_do_get_result) 3 0.000 0.000 1.944 0.648 cursors.py:351(_get_result) 3 1.944 0.648 1.944 0.648 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 1.006 0.335 cursors.py:358(_post_get_result) 3 0.000 0.000 1.005 0.335 cursors.py:324(_fetch_row) 3 1.005 0.335 1.005 0.335 {built-in method fetch_row} 591285 0.158 0.000 0.158 0.000 {range}

2核机器：


数据库访问耗时2.32秒

矩阵操作需要242.45秒

1842390 function calls (1841980 primitive calls) in 246.535 seconds Ordered by: cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 1 224.705 224.705 246.536 246.536 20_with_multiprocessing.py:1(<module>) 406000 0.911 0.000 17.971 0.000 {method 'sum' of 'numpy.ndarray' objects} 406000 0.526 0.000 17.060 0.000 _methods.py:31(_sum) 406001 16.534 0.000 16.534 0.000 {method 'reduce' of 'numpy.ufunc' objects} 1 0.617 0.617 3.113 3.113 get_BP_Verteilung_Vektoren.py:1(get_BP_Verteilung_Vektoren) 3 0.000 0.000 2.789 0.930 cursors.py:164(execute) 3 0.000 0.000 2.789 0.930 cursors.py:353(_query) 3 0.000 0.000 1.938 0.646 cursors.py:315(_do_query) 3 0.000 0.000 1.920 0.640 cursors.py:142(_do_get_result) 3 0.000 0.000 1.920 0.640 cursors.py:351(_get_result) 3 1.920 0.640 1.920 0.640 {method 'store_result' of '_mysql.connection' objects} 3 0.001 0.000 0.851 0.284 cursors.py:358(_post_get_result) 3 0.000 0.000 0.849 0.283 cursors.py:324(_fetch_row) 3 0.849 0.283 0.849 0.283 {built-in method fetch_row} 591285 0.160 0.000 0.160 0.000 {range}

Answer 1

你的程序似乎花了大部分时间来获取锁。这似乎表明，在您的情况下，多处理会比它有所帮助。

删除所有多处理内容并开始测量没有它的时间。例如。像这样。

from math import floor
from math import ceil
import numpy
import MySQLdb
import time

start = time.clock()
db = MySQLdb.connect(...)
cursor  = db.cursor()
query = "SELECT ...."
cursor.execute(query)
stop = time.clock()
print "Database access took {:.2f} seconds".format(stop - start)

start = time.clock()
[...]  #save db results into the variable db_matrix (30 columns, 5.000 rows)
[...]  #save db results into the variable bp_vector (3 columns, 500 rows)
[...]  #save db results into the variable option_vector( 3 columns, 4000 rows)
stop = time.clock()
print "Creating matrices took {:.2f} seconds".format(stop - start)
cursor.close()
db.close()

counter = 0 

start = time.clock()
for i in range(4000):
    for j in range(500):
         helper[:] = (1-bp_vector[j,0]-bp_vector[j,1]-bp_vector[j,2])*db_matrix[:,0] 
                     + db_matrix[:,option_vector[i,0]] * bp_vector[j,0]  
                     + db_matrix[:,option_vector[i,1]] * bp_vector[j,1]   
                     + db_matrix[:,option_vector[i,2]] * bp_vector[j,2]

         result[counter,0] = (helper < -7.55).sum()

         counter = counter + 1
stop = time.clock()
print "Matrix manipulation took {:.2f} seconds".format(stop - start)

修改-1

根据您的测量结果，我坚持我的结论（稍微改写一下），在多核机器上，使用multiprocessing ，就像现在一样会非常伤害您的表现。在双核机器上，具有多处理功能的程序比没有它的程序需要更长的时间！

我认为使用单核心机器时使用多处理与使用多处理之间没有区别。无论如何，单核机器不会从多处理中获得太多好处。

新的测量表明，大部分时间都花在矩阵操作上。这是合乎逻辑的，因为您使用的是显式嵌套for循环，这不是很快。

基本上有四种可能的解决方案;

首先是将嵌套循环重写为numpy操作。 Numpy操作具有隐式循环（用C编写）而不是Python中的显式循环，因此更快。（一种罕见的情况，其中显式更糟而不是隐含。;-)）缺点是这可能会占用大量内存。

第二个选项是分割helper的计算，其中包括4个部分。在单独的过程中执行每个部分，并在结尾处将结果一起添加。这确实会产生一些开销;每个进程都必须从数据库中检索所有数据，并且必须将部分结果传回主进程（也可能通过数据库？）。

第三个选项可能是使用pypy而不是Cpython。它可以明显更快。

第四种选择是在Cython或C中重写关键矩阵操作。

Python中的多处理：Numpy + Vector Summation - ＆gt;巨大的减速

问题：

我已经尝试过：

代码：

我猜：

时间测量:(编辑）

更新:(编辑）

更新-2 :(编辑）

更新 - 3 :(编辑）

回答答案-1 :(编辑）

1 个答案: