我有两个数据框 - df1 (800k rows) and df2 (3 rows).
如果df1_A
的值介于df2_A and df2_B
之间,则df2_C
的值应返回df1_C
。
for index1, row1 in df1.iterrows():
for index2, row2 in df2.iterrows():
if (row1['df1_A'] >= row2['df2_A']) & (row1['df1_A'] <= row2['df2_B']):
row1['df1_C'] = row2['df2_C']
最简单易读的方法是使用两个for循环,但是,因为它已经循环了大约240万次,所以我的程序的性能会降低。有没有更好的方法来完成我的任务。
答案 0 :(得分:0)
好的,所以你的循环代码改为:
for row1 in df1.rows:
for row2 in df2.rows:
if (row1.A >= row2.A) & (row1.A <= row2.B):
row1.C = row2.C
让我们翻转循环:
for row2 in df2.rows:
for row1 in df1.rows:
if (row1.A >= row2.A) & (row1.A <= row2.B):
row1.C = row2.C
现在,删除外部循环不是很重要,因为它只运行三次。让我们对内部部分进行矢量化:
for row2 in df2.rows:
df1.C[(df1.A >= row2.A) & (df1.A <= row2.B)] = row2.C
简化:
for row2 in df2.rows:
df1.C[df1.A.between(row2.A, row2.B)] = row2.C
我希望这足够好。请告诉我们这个速度有多快。
答案 1 :(得分:0)
让我们使用df2
只包含三行的事实!
考虑以下矢量化方法:
<强>设定:强>
df1 = pd.DataFrame(np.random.randint(100, size=(10**6, 1)), columns=['val'])
df2 = pd.DataFrame({'A': {0: 1, 1: 10, 2: 20}, 'B': {0: 5, 1: 13, 2: 20}})
<强>解决方案:强>
qry = ' | '.join(['{0[0]}<=val<={0[1]}'.format(r) for r in df2.values.tolist()])
df1.query(qry)
时间:表示1.000.000行DF:
In [34]: df1.shape
Out[34]: (1000000, 1)
In [35]: %timeit df1.query(qry)
10 loops, best of 3: 46.6 ms per loop
生成的查询
In [36]: qry
Out[36]: '1<=val<=5 | 10<=val<=13 | 20<=val<=20'