所以我一直在玩 multiprocessing
模块,试图找出加快我使用pandas
DataFrames做的大量工作的方法。
我正在使用的示例是获取一系列Excel文件,每个文件代表数年的数据,将它们转换为数据框,然后对其中一列进行求和。顺便说一下,这样的事情:
now = time.time()
dict = {}
table_2010 = pd.read_excel('2010.xlsx')
table_2011 = pd.read_excel('2011.xlsx')
table_2012 = pd.read_excel('2012.xlsx')
table_2013 = pd.read_excel('2013.xlsx')
table_2014 = pd.read_excel('2014.xlsx')
table_2015 = pd.read_excel('2015.xlsx')
dict[2011] = table_2011[[95]].sum()
dict[2010] = table_2010[[95]].sum()
dict[2012] = table_2012[[95]].sum()
dict[2013] = table_2013[[95]].sum()
dict[2014] = table_2014[[95]].sum()
dict[2015] = table_2015[[95]].sum()
print dict
print time.time() - now
这花了我 205秒,Excel文件相当大,需要一段时间加载到数据框中,我认为在并行中运行它会改善性能。我想出的是:
def func(year):
table = pd.read_excel(str(year) + '.xlsx')
dict[year] = table[[95]].sum()
if __name__ == '__main__':
now = time.time()
dict = {}
pool = ThreadPool(8)
pool.map_async(func, [2010,2011,2012,2013,2014,2015])
pool.close()
pool.join()
print dict
print time.time() - now
当我跑步时,它最终 250秒。我的印象是,让每个进程运行单独的核心可以提高性能,这是不正确的吗?
或者我写的脚本有问题吗?
答案 0 :(得分:-1)
较慢?
取决于。
取决于,很多。
是否存在脚本问题?
是的,一个粗野的(仍然无需担心或恐慌 - 一个可以解决的问题)。 享受阅读。
# =========================================================================[sec]
an-<iterator>-based SERIAL processing of 9 CPU-bound tasks took 1290.538 [sec]
aThreadPool(6)-based TPOOL processing of 9 CPU-bound tasks took 1212.065 [sec]
aPool(6)-based POOL processing of 9 CPU-bound tasks took 271.765 [sec]
# =========================================================================[sec]
multiprocessing
有几个Pool
- s 基于没有完整记录的MCVE-above(缺少所有明确的命名空间import
- s来安全地消除预期用例的设置),让我们从代码中提到 ThreadPool.map_async()
并处理了许多Excel文件。
很难为预期的快速处理开始更糟糕的方法。
Pool
慢于SEQ
?故意借用原生并行的 occam
语言,这个问题往往会导致 PAR
|的痛苦。 SEQ
在设计高性能系统时面临的困境(是的,HPC,当然,猜测,谁愿意故意设计慢速系统,对吧?)。
这个问题是多方面的,在能够认真解决最初困境的问题之前,可以提出更多问题。
我们有什么资源?
在 PAR
|中执行类型的操作 SEQ
安排:
纯粹是{CPU-bound |的问题IO绑定}?
是需要共享的问题 - {state |处理过程中的数据?
是需要沟通的问题 - {signals |处理过程中的消息?
CPU绑定处理对于模拟来说要简单得多(以及“佩戴和撕掉宝贵的物理HPC资源的方式”),所以让我们开始使用原始函数:
def aFatCALCULUS( id ): # an INTENSIVE CPU-bound WORKLOAD
import numpy as np
import os
pass; aST = "aFatCALCULUS( {1:>3d} ) [PID:: {0:d}] RET'd {2:d}"
return( aST.format( os.getpid(),
id,
id + len( str( [ np.math.factorial( 2**f ) for f in range( 20 ) ][-1] ) )
)
)
现在,让我们以不同的方式执行这几次。
原谅我的非PEP-8格式(我们不赞助任何核心重构与此处所做的演示,所以确实没有人会认为这种选择在任何意义上都不合适。)
from multiprocessing.pool import ThreadPool # ThreadPool-mode
from multiprocessing import Pool # Pool-mode
pass; import time
print( "{0:}----------------------------------------------------------- # SETUP:".format( time.ctime() ) )
aListOfTaskIdNUMBERs = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, ]
print( "{0:}----------------------------------------------------------- # PROCESSING-ThreadPool mode of EXECUTION:".format( time.ctime() ) )
aTPool = ThreadPool( 6 ) # PROCESSING-ThreadPool.capacity == 6
print( "{0:}----------------------------------------------------------- # SERIAL mode of EXECUTION:".format( time.ctime() ) )
start = time.clock_gettime( time.CLOCK_MONOTONIC_RAW );
pass; [ aFatCALCULUS( id ) for id in aListOfTaskIdNUMBERs ] # SERIAL <iterator>-driven mode of EXECUTION
pass; duration = time.clock_gettime( time.CLOCK_MONOTONIC_RAW ) - start; print( "an-<iterator>-based SERIAL processing of {1:}-tasks took {0:} [sec]".format( duration, len( aListOfTaskIdNUMBERs ) ) )
pass;
print( "{0:}----------------------------------------------------------- # PROCESSING-Pool mode of EXECUTION:".format( time.ctime() ) )
aPool = Pool( 6 ) # PROCESSING-Pool.capacity == 6
start = time.clock_gettime( time.CLOCK_MONOTONIC_RAW );
pass; aPool.map( aFatCALCULUS, aListOfTaskIdNUMBERs ) # PROCESSING-Pool-driven mode of EXECUTION
pass; duration = time.clock_gettime( time.CLOCK_MONOTONIC_RAW ) - start; print( "aPool(6)-based processing of {1:}-tasks took {0:} [sec]".format( duration, len( aListOfTaskIdNUMBERs ) ) )
print( "{0:}----------------------------------------------------------- # END.".format( time.ctime() ) )
PID
#s aPool(6).map()
["aFatCALCULUS( 1 ) [PID:: 898] RET'd 2771011",
"aFatCALCULUS( 2 ) [PID:: 899] RET'd 2771012",
"aFatCALCULUS( 3 ) [PID:: 900] RET'd 2771013",
"aFatCALCULUS( 4 ) [PID:: 901] RET'd 2771014",
"aFatCALCULUS( 5 ) [PID:: 902] RET'd 2771015",
"aFatCALCULUS( 6 ) [PID:: 903] RET'd 2771016",
"aFatCALCULUS( 7 ) [PID:: 898] RET'd 2771017",
"aFatCALCULUS( 8 ) [PID:: 899] RET'd 2771018",
"aFatCALCULUS( 9 ) [PID:: 903] RET'd 2771019"
]
aThreadPool(6)
["aFatCALCULUS( 1 ) [PID:: 16125] RET'd 2771011",
"aFatCALCULUS( 2 ) [PID:: 16125] RET'd 2771012",
"aFatCALCULUS( 3 ) [PID:: 16125] RET'd 2771013",
"aFatCALCULUS( 4 ) [PID:: 16125] RET'd 2771014",
"aFatCALCULUS( 5 ) [PID:: 16125] RET'd 2771015",
"aFatCALCULUS( 6 ) [PID:: 16125] RET'd 2771016",
"aFatCALCULUS( 7 ) [PID:: 16125] RET'd 2771017",
"aFatCALCULUS( 8 ) [PID:: 16125] RET'd 2771018",
"aFatCALCULUS( 9 ) [PID:: 16125] RET'd 2771019"
]
超级计算机将计算限制问题转化为I / O限制问题(S. Cray)
服从Seymour CRAY的智慧尽职尽责,但
不要让别人让你成为一个人,谁支付他们缺少HPC职责的费用
在你的CPU预算方面。
恕我直言,如果这是我的HPC任务,我会
避免支付 pandas
XLSX导入/转化费用
将Excel数据 - 所有者 / 处理器设为保证人并执行其自动列 - SUM()
- {汽车|手册|脚本} - 更新对每个数据元素更改/更新及时到达,无论是批量还是事件,都在数据存储端
寻求最快(分布式处理)架构,独立multiprocessing.Pool().map()
进程的权限不读取(==移动所有数据堆)但使用智能,直接访问,只对单元格(==元素)你需要处理。
PAR
排列 Pool()
比其他任何人都快 SEQ
处理:''' REAL SYSTEM:: multiprocessing.Pool(6).map()
_________________________________________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________________________________________
top - 22:24:42 up 84 days, 23:05, 4 users, load average: 4.80, 2.17, 0.86
Threads: 366 total, 5 running, 361 sleeping, 0 stopped, 0 zombie
%Cpu0 : 75.7/0.0 76[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu1 : 0.1/0.0 0[ ]
%Cpu2 : 0.0/0.0 0[ ]
%Cpu3 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu4 : 0.1/0.0 0[ ]
%Cpu5 : 0.0/0.0 0[ ]
%Cpu6 : 76.2/0.0 76[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu7 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu8 : 0.0/0.0 0[ ]
%Cpu9 : 75.5/0.0 76[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
%Cpu10 : 0.5/0.4 1[ ]
%Cpu11 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu12 : 0.0/0.0 0[ ]
%Cpu13 : 0.0/0.0 0[ ]
%Cpu14 : 0.0/0.0 0[ ]
%Cpu15 : 0.7/0.5 1[|| ]
%Cpu16 : 0.0/0.0 0[ ]
%Cpu17 : 0.0/0.0 0[ ]
%Cpu18 : 0.0/0.0 0[ ]
%Cpu19 : 0.0/0.0 0[ ]
KiB Mem : 24522940 total, 22070528 free, 778080 used, 1674332 buff/cache
KiB Swap: 8257532 total, 7419136 free, 838396 used. 22905264 avail Mem
P S %CPU PPID PID nTH TIME+ USER PR NI RES CODE SHR DATA %MEM VIRT vMj vMn SWAP nsIPC COMMAND
1 S 0.0 1614 1670 1 10:54.15 root 20 0 632 740 416 1396 0.0 52140 0 0 1172 - `- haproxy
2 S 0.0 1614 1671 1 35:40.50 root 20 0 664 740 380 1528 0.0 52272 0 0 1172 - `- haproxy
19 S 0.0 1 1658 1 6:20.42 root 20 0 22960 468 14380 14240 0.1 466344 0 0 836 - `- httpd
12 S 0.0 1 24217 1 4:31.41 root 20 0 3984 8 668 7320 0.0 155304 0 0 3864 - `- munin-node
0 R 0.0 12882 4964 1 0:31.53 m 20 0 2596 96 1524 1596 0.0 158096 0 0 0 4026531839 `- top
0 S 0.1 15213 22779 22 0:11.16 m 20 0 54052 2268 5816 1965528 0.2 2191768 0 0 0 4026531839 `- python3
1 S 0.1 15213 23613 22 0:10.83 m 20 0 54052 2268 5816 1965528 0.2 2191768 0 0 0 4026531839 `- python3
7 R 99.9 16125 898 1 2:29.72 m 20 0 52084 2268 1336 1969112 0.2 2195352 0 3k 0 4026531839 `- python3
11 R 99.9 16125 899 1 2:29.72 m 20 0 52088 2268 1336 1969116 0.2 2195356 0 3k 0 4026531839 `- python3
6 S 76.3 16125 900 1 2:15.49 m 20 0 49724 2268 1236 1965520 0.2 2191760 0 777 0 4026531839 `- python3
0 S 75.7 16125 901 1 2:15.12 m 20 0 49732 2268 1236 1965524 0.2 2191764 0 775 0 4026531839 `- python3
9 S 75.6 16125 902 1 2:15.05 m 20 0 49732 2268 1236 1965524 0.2 2191764 0 775 0 4026531839 `- python3
3 R 99.9 16125 903 1 2:29.70 m 20 0 52100 2268 1336 1969120 0.2 2195360 0 3k 0 4026531839 `- python3
4 S 0.1 15213 904 22 0:00.36 m 20 0 54052 2268 5816 1965528 0.2 2191768 0 0 0 4026531839 `- python3
15 S 1.2 19285 21279 2 21:27.31 a 20 0 75720 2268 12940 196868 0.3 642876 0 0 0 - `- python3
8 S 0.0 19285 21281 2 0:14.88 a 20 0 75720 2268 12940 196868 0.3 642876 0 0 0 - `- python3
10 S 0.9 22118 22120 2 20:07.34 a 20 0 56604 2268 7176 464808 0.2 722164 0 0 0 - `- python3
4 S 0.0 22118 22122 2 0:19.39 a 20 0 56604 2268 7176 464808 0.2 722164 0 0 0 - `- python3
4 S 0.0 2 29 1 33:46.57 root 20 0 0 0 0 0 0.0 0 0 0 0 - `- rcu_sched
_________________________________________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________________________________________
top - 22:25:31 up 84 days, 23:06, 4 users, load average: 3.78, 2.30, 0.97
Threads: 365 total, 4 running, 361 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.2/0.4 1[ ]
%Cpu1 : 0.0/0.0 0[ ]
%Cpu2 : 0.0/0.0 0[ ]
%Cpu3 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu4 : 0.2/0.0 0[ ]
%Cpu5 : 0.0/0.0 0[ ]
%Cpu6 : 0.0/0.0 0[ ]
%Cpu7 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu8 : 0.0/0.0 0[ ]
%Cpu9 : 0.0/0.0 0[ ]
%Cpu10 : 0.6/0.4 1[| ]
%Cpu11 : 100.0/0.0 100[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
%Cpu12 : 0.0/0.0 0[ ]
%Cpu13 : 0.0/0.0 0[ ]
%Cpu14 : 0.0/0.0 0[ ]
%Cpu15 : 0.6/0.6 1[|| ]
%Cpu16 : 0.0/0.0 0[ ]
%Cpu17 : 0.0/0.0 0[ ]
%Cpu18 : 0.0/0.0 0[ ]
%Cpu19 : 0.0/0.0 0[ ]
KiB Mem : 24522940 total, 22076660 free, 772436 used, 1673844 buff/cache
KiB Swap: 8257532 total, 7419136 free, 838396 used. 22911364 avail Mem
P S %CPU PPID PID nTH TIME+ USER PR NI RES CODE SHR DATA %MEM VIRT vMj vMn SWAP nsIPC COMMAND
2 S 0.2 1614 1671 1 35:40.51 root 20 0 664 740 380 1528 0.0 52272 0 0 1172 - `- haproxy
0 R 0.4 12882 4964 1 0:31.66 m 20 0 2596 96 1524 1596 0.0 158096 0 0 0 4026531839 `- top
7 R 99.9 16125 898 1 3:18.45 m 20 0 52608 2268 1336 1969112 0.2 2195352 0 9 0 4026531839 `- python3
11 R 99.9 16125 899 1 3:18.46 m 20 0 52612 2268 1336 1969116 0.2 2195356 0 9 0 4026531839 `- python3
3 R 99.9 16125 903 1 3:18.43 m 20 0 52624 2268 1336 1969120 0.2 2195360 0 10 0 4026531839 `- python3
15 S 1.2 19285 21279 2 21:27.92 a 20 0 75720 2268 12940 196868 0.3 642876 0 0 0 - `- python3
10 S 1.0 22118 22120 2 20:07.81 a 20 0 56604 2268 7176 464808 0.2 722164 0 0 0 - `- python3
_________________________________________________________________________________________________________________________________________________________
_________________________________________________________________________________________________________________________________________________________
'''