我需要为熊猫数据帧的每一行计算一列df['P']
的自定义总和。我目前正在将其作为for循环来进行,我意识到这是非常低效的,但请允许我列出计算的结构。我试图提出一个更符合pythonic / pandas的实现,以减少运行时间。我使用本文中的解决方案:pandas: rapidly calculating sum of column with certain values来提高速度,但是它仍然运行非常缓慢。
def weight_sum(inc_grp, taz, chosen, probs, hh_id, row_inc_grp, row_taz, row_hh_id):
return beta_dict['RHO'] * (sum(p for i,j,k,p in zip(inc_grp, taz, chosen, probs) \
if i==row_inc_grp and j in w[row_taz] and k==1)
+ sum(p for i,j,k,p in zip(inc_grp, hh_id, chosen, probs) \
if i==row_inc_grp and j!=row_hh_id and k==1))
inc_grp = df['income_grp'].values
taz = df['taz'].values
chosen = df['chosen'].values
hh_id = df['hh_id'].values
probs = df['P'].values
for row in df.itertuples():
df.loc[row[0], 'V_comb'] = row.V_comb + weight_sum(inc_grp, taz, chosen, probs,
hh_id, row.income_grp, row.taz, row.hh_id)
基本上,该代码执行以下操作:
df['income_grp']
等于目标行而df['chosen']
列等于1的行df['taz']
值相对应的键相匹配,并且该项是与我想求和的与df['taz']
值相关联的列表。 / li>
df['hh_id']
定义的行)进行类似的子集提取我敢肯定有办法做到这一点,但它一直使我难以理解。数据框中大约有28,000行,这部分代码是运行时的主要消耗。有没有一种方法可以一次在整个dataframe列上应用此操作?我认为groupby()。sum()可能有用。
这是数据框的子集:
hh_mem_id hh_id memb_id taz_struc taz income_grp chosen V_comb P
0 11 11 0 4028.2 4028 2 1 2.0289830623 0.1420552675
1 2002 2002 0 4028.2 4028 3 0 0.1571991902 0.0109275283
2 3775.1 3775 1 4028.2 4028 3 0 1.5821643888 0.045433528
3 1099.2 1099 2 4028.2 4028 3 0 0.3537670241 0.0133011829
4 3249.1 3249 1 4028.2 4028 3 0 0.6103028388 0.017191048
5 2903 2903 0 4028.2 4028 3 0 0.3912196062 0.0276175857
6 3671 3671 0 4028.2 4028 4 0 1.1843450617 0.0203476596
7 133 133 0 4028.2 4028 3 0 0.4345199881 0.014419853
8 1563.2 1563 2 4028.2 4028 5 0 0.0036775258 0.0062482309
9 142 142 0 4028.2 4028 4 0 0.7255248979 0.0192904633
10 5097 5097 0 4028.2 4028 3 0 0.0811923744 0.0202554826
11 3489.2 3489 2 4028.2 4028 4 0 -0.2867591139 0.0046732825
12 2432.1 2432 1 4028.2 4028 2 0 0.0827980747 0.0101440165
13 4296 4296 0 4028.2 4028 3 0 0.5167749373 0.0156561042
14 5377 5377 0 4028.2 4028 2 0 -1.0837694081 0.0063183855
15 3546 3546 0 4028.2 4028 1 0 -1.1511959076 0.0059064042
16 3084 3084 0 4028.2 4028 2 0 -0.6162896774 0.0100839339
17 3506.1 3506 1 4028.2 4028 5 0 0.8353570673 0.0143532716
18 798.1 798 0 4028.2 4028 3 0 1.1557859384 0.0593243037
19 4067 4067 0 4028.2 4028 5 0 0.7786698771 0.013562257
20 786.2 786 2 4028.2 4028 5 0 0.1487080264 0.0054175668
21 4155 4155 0 4028.2 4028 5 0 0.2379145637 0.0118461215
22 3036.1 3036 1 4028.2 4028 5 0 0.9867959382 0.0125251009
23 4223.1 4223 1 4028.2 4028 5 0 0.7162872899 0.0127420574
24 3510 3510 0 4028.2 4028 2 0 -0.4016915094 0.0124976624
25 1736.1 1736 0 4028.2 4028 3 0 1.3770839318 0.0370093239
26 2336.1 2336 1 4028.2 4028 3 0 0.626406915 0.0174701352
27 2367.1 2367 1 4028.2 4028 5 0 0.2879033723 0.0124533457
28 4150.2 4150 2 4028.2 4028 5 0 -0.2505594914 0.0048455529
29 4270 4270 0 4028.2 4028 5 0 0.5620574806 0.0109208993
30 2002.1 2002 1 4028.2 4028 3 0 -0.694312505 0.0046635336
31 3775 3775 0 4028.2 4028 3 0 -0.251272972 0.0072631453
32 1099.1 1099 0 4028.2 4028 3 0 0.7689167591 0.0201459385
33 3249 3249 0 4028.2 4028 3 0 0.0015696848 0.0093526117
34 3671.2 3671 2 4028.2 4028 4 0 -0.0300530998 0.006040989
35 3671.1 3671 1 4028.2 4028 4 0 0.7186898628 0.0127727079
36 133.1 133 1 4028.2 4028 3 0 0.1183203344 0.0105108313
37 1563 1563 0 4028.2 4028 5 0 0.7554359922 0.0132507855
38 1563.3 1563 3 4028.2 4028 5 0 0.856618101 0.0146617042
39 142.1 142 1 4028.2 4028 4 0 -0.5234586083 0.0055324311
40 3489.1 3489 1 4028.2 4028 4 0 0.5136023055 0.0104043412
41 3489 3489 0 4028.2 4028 4 0 1.0174426754 0.0172198625
42 2432 2432 0 4028.2 4028 2 0 0.2873825304 0.0124468612
43 4296.1 4296 1 4028.2 4028 3 0 0.0794730632 0.0101103435
44 3506.2 3506 2 4028.2 4028 5 0 0.0184839582 0.0063414332
45 3506 3506 0 4028.2 4028 5 0 0.2625970387 0.0080947676
46 4067.2 4067 2 4028.2 4028 5 0 0.6172063558 0.0115400915
47 4067.1 4067 1 4028.2 4028 5 0 0.6173185103 0.0115413859
48 786.3 786 3 4028.2 4028 5 0 0.1487080264 0.0054175668
49 786.1 786 1 4028.2 4028 5 0 0.6050092935 0.0085501434
50 786 786 0 4028.2 4028 5 0 0.7613981637 0.0099975187
51 4155.1 4155 1 4028.2 4028 5 0 0.6072911746 0.0171393523
52 3036.2 3036 2 4028.2 4028 5 0 0.7048105533 0.0094474921
53 3036 3036 0 4028.2 4028 5 0 0.627374922 0.0087435273
54 3036.5 3036 5 4028.2 4028 5 0 0.5908809189 0.0084301932
55 4223 4223 0 4028.2 4028 5 0 0.9146967449 0.0155384498
56 4223.3 4223 3 4028.2 4028 5 0 0.9352868379 0.0158617044
57 1736.3 1736 3 4028.2 4028 3 0 0.4855928507 0.0151754471
58 2336 2336 0 4028.2 4028 3 0 0.5800003478 0.0166779301
59 2367 2367 0 4028.2 4028 5 0 0.5503894858 0.0161913222
60 4150 4150 0 4028.2 4028 5 0 0.2127295435 0.0077010015
61 4150.1 4150 1 4028.2 4028 5 0 0.4936026393 0.0101983249
62 4270.2 4270 2 4028.2 4028 5 0 0.9579755018 0.0162256989
63 4270.1 4270 1 4028.2 4028 5 0 0.6540339302 0.0119730078
64 12 12 0 3649.1 3649 5 1 0.7922317695 0.0119365752
65 1922 1922 0 3649.1 3649 2 0 -0.4376740892 0.0069786016
66 5434 5434 0 3649.1 3649 2 0 1.5455019765 0.0507050046
67 3427 3427 0 3649.1 3649 3 0 1.0252726867 0.030138256
68 1710 1710 0 3649.1 3649 3 0 1.4636873348 0.0467217584
69 215 215 0 3649.1 3649 4 0 0.8383515125 0.0083333194
70 3872.1 3872 1 3649.1 3649 5 0 0.5878580212 0.0097301906
71 4184 4184 0 3649.1 3649 3 0 1.6013392113 0.0536167678
72 2305 2305 0 3649.1 3649 2 0 0.914665738 0.0134912482
73 3928 3928 0 3649.1 3649 3 0 1.6743119993 0.0576756249
74 3653 3653 0 3649.1 3649 3 0 1.1358984857 0.0336637343
75 138 138 0 3649.1 3649 3 0 1.7493749526 0.0310857779
76 458 458 0 3649.1 3649 3 0 1.4085683914 0.0442161909
77 1469 1469 0 3649.1 3649 3 0 1.2873661026 0.0391691224
78 5625.2 5625 2 3649.1 3649 5 0 0.2433721144 0.0045964417
79 2606.1 2606 1 3649.1 3649 5 0 0.5828831254 0.0096819041
80 3931.1 3931 1 3649.1 3649 4 0 0.9396346763 0.0069161756
81 4131.2 4131 2 3649.1 3649 5 0 0.5232201888 0.0045605739
82 4302.1 4302 1 3649.1 3649 3 0 0.893931835 0.013214402
83 1754 1754 0 3649.1 3649 2 0 -0.3000669052 0.0080081177
84 2936.1 2936 0 3649.1 3649 3 0 0.6754471945 0.0212417765
85 2737.2 2737 2 3649.1 3649 3 0 -0.5740444845 0.0030444826
86 4040 4040 0 3649.1 3649 3 0 1.0270476272 0.0150958985
87 3007 3007 0 3649.1 3649 5 0 0.8287041974 0.0082533118
88 4198 4198 0 3649.1 3649 2 0 1.7898540629 0.0647398352
89 4886 4886 0 3649.1 3649 5 0 1.0735474149 0.010542954
90 2898 2898 0 3649.1 3649 2 0 1.4747234015 0.0472402386
91 507 507 0 3649.1 3649 3 0 1.0621690726 0.0312710176
92 3320 3320 0 3649.1 3649 2 0 1.8349981668 0.0677294306
93 1725.2 1725 2 3649.1 3649 3 0 0.7758190633 0.0117422626
94 215.2 215 2 3649.1 3649 4 0 0.2386153377 0.0045746294
95 215.1 215 1 3649.1 3649 4 0 1.499844627 0.0161473343
96 3872 3872 0 3649.1 3649 5 0 0.9871911231 0.0145060613
97 2305.2 2305 2 3649.1 3649 2 0 0.7395638436 0.0113241691
98 138.1 138 1 3649.1 3649 3 0 0.9743617728 0.0143211467
99 5625 5625 0 3649.1 3649 5 0 0.5903762734 0.0065031497
100 5625.1 5625 1 3649.1 3649 5 0 0.9824527912 0.0096249929
101 2606 2606 0 3649.1 3649 5 0 1.2693837925 0.0192355331
102 3931.2 3931 2 3649.1 3649 4 0 0.928477973 0.0068394427
103 3931 3931 0 3649.1 3649 4 0 0.855892031 0.0063605847
104 3931.3 3931 3 3649.1 3649 4 0 0.8567504113 0.0063660469
105 4131.3 4131 3 3649.1 3649 5 0 0.7858987531 0.0059306097
106 4131 4131 0 3649.1 3649 5 0 0.4918550313 0.0044197508
107 4131.1 4131 1 3649.1 3649 5 0 1.3324098035 0.010243446
108 4302 4302 0 3649.1 3649 3 0 1.0205806143 0.0149985882
109 2737.1 2737 0 3649.1 3649 3 0 0.7340224027 0.0112615905
110 4040.1 4040 1 3649.1 3649 3 0 0.6811995799 0.0106821598
111 3007.1 3007 1 3649.1 3649 5 0 0.825227624 0.0082246684
112 3007.2 3007 2 3649.1 3649 5 0 0.7815236308 0.007872959
113 4886.1 4886 1 3649.1 3649 5 0 0.7827331819 0.0078824876
114 4886.2 4886 2 3649.1 3649 5 0 0.7767939208 0.0078358102
115 1725.1 1725 0 3649.1 3649 3 0 0.9985947281 0.0146724295
116 12.1 12 1 3649.1 3649 5 1 1.0093720796 0.0148314146
117 40 40 0 3602.2 3602 3 1 1.4149337468 0.0496880853
118 2728 2728 0 3602.2 3602 3 0 0.2540527003 0.0155628105
119 4786.1 4786 0 3602.2 3602 3 0 1.8863507604 0.0796133813
这是df['taz'] == 4028
的'w'的示例条目:
{3602: 1.0, 4027: 1.0, 4029: 1.0}
对于第1行,我需要计算df['P'].sum()
,其中df['taz'] == 4028
,df['inc_grp] == 2
和df['chosen'] == 1
。我还需要求和df['hh_id'] != 11
,df['inc_grp] == 2
和df['chosen'] == 1
的地方。这应该添加到列df['V_comb']
中。我需要对数据帧的每一行都执行此操作,并且代码是多次运行,因为它是优化算法的一部分。
答案 0 :(得分:0)
根据您编辑的帖子,这应该可以完成您想要的操作:
df['V_comb'] = df[(df['income_grp']==2) & (df['taz']==4028) & (df['chosen']==1)][['P','V_comb']].sum(axis=1)
答案 1 :(得分:0)
我能够通过组合更改来大大改善运行时间。首先,没有理由每次优化都在数据帧上执行过滤。我在程序开始时在for循环中执行过一次,通过将其放入函数中并使用cython进行了优化。结果是一个numpy数组,其中包含0/1,用于确定每对行之间的每个条件是否为true。然后,我可以获取该矩阵的点积与数据帧列的矢量化形式的概率之和。现在,根据我的分析,大多数时间都花在了优化上(通过将初始参数值更新为上次运行的输出可以轻松地进行改进)。代码片段:
import numpy as np
cimport numpy as np
def get_filt_mat(long[:, :] X, double[:, :] Y, M):
cdef int N = X.shape[0]
cdef int[:] indices, indptr
cdef int i, j
indices = M.indices.astype(np.int32)
indptr = M.indptr.astype(np.int32)
cdef int I = indptr.shape[0]
for i in range(N):
for j in range(N):
if X[i,0] == X[j,0] and X[j,3] == 1:
if N<=I:
if indptr[i]==X[i,2] and indices[j]==X[j,2]:
Y[i,j] = 1
if X[i,1] == X[j,1] and X[j,2] != X[i,2]:
Y[i,j] = 1
return Y
函数调用:
N = df.shape[0]
filtArray = np.zeros((N,N))
inArray = df[['income_grp', 'taz', 'hh_id', 'chosen']].values
outArray = get_filt_mat(inArray, filtArray, ws)
outArray = outArray.base
应用到数据框列:
vectProb = df['P'].values
df['P_w'] = outArray.dot(vectProb) * beta_dict['RHO']
这是我第一次使用cython,这可能不是完美的代码,但是现在使用纯python和pandas的原始算法,它运行大约需要10分钟而不是14个小时,而没有完成。我发现这些资源很有用(尤其是使cython处理稀疏矩阵):