我有一个数据框 df1 :
key1 key2 val
1 100
2 500
4 400
我还有一个多索引数据框 df2 :
c
a b
1 100 a
2 200 b
3 300 j
4 400 e
5 500 t
我想从多索引数据框 df2
填充我的df1列val
我试过了:
for index,row in df1.iterrows():
try:
data = df2.loc([row['key1'],row['key2'])
df1.loc[(df1.key1 == row['key1']) & (df1.key2 == row['key2']), 'val'] = data
except:
pass
最后,我的df1应该是这样的:
key1 key2 val
1 100 a
2 500
4 400 e
但我主要担心的是df2(多索引df)的实际长度大约为60-70,000行。
df1的长度几乎不是10行。 (我想重复这个过程,让df1包含其他数据)
这样做.loc使用for循环工作吗?它是最快的吗?
或者使用.apply会更快?
我希望这次迭代最快。
以最快的方式运行
的所有潜在客户?答案 0 :(得分:2)
在熊猫中最好避免loops
s - iterrows
和apply
(引擎盖下的循环),更好的是矢量化解决方案。
使用join
参数on
:
#for improve performance sort index and columns
df2 = df2.sort_index()
df1 = df1.sort_values(['key1','key2'])
df = df1.join(df2, on=['key1','key2'])
print (df)
key1 key2 val c
0 1 100 NaN a
1 2 500 NaN NaN
2 4 400 NaN e
编辑:
另一个方法是加入MultiIndex
和列值并使用map
:
df2.index = ['{}_{}'.format(a,b) for a, b in df2.index]
print (df2)
c
1_100 a
2_200 b
3_300 j
4_400 e
5_500 t
df1['joined'] = df1['key1'].astype(str) + '_' + df1['key2'].astype(str)
print (df1)
key1 key2 val joined
0 1 100 NaN 1_100
1 2 500 NaN 2_500
2 4 400 NaN 4_400
df1['col'] = df1['joined'].map(df2['c'])
print (df1)
key1 key2 val joined col
0 1 100 NaN 1_100 a
1 2 500 NaN 2_500 NaN
2 4 400 NaN 4_400 e
<强>计时强>:
np.random.seed(123)
N = 100000
df2 = pd.DataFrame(np.random.randint(10000, size=(N, 3)), columns=list('abc'))
df2 = df2.drop_duplicates(['a','b']).set_index(['a','b'])
print (df2.head())
c
a b
3582 1346 5218
7763 9785 7382
5857 96 6257
6782 4143 4169
5664 942 6368
df1 = df2.iloc[np.random.randint(N, size=10)].reset_index()
df1.columns = ['key1','key2','val']
print (df1)
key1 key2 val
0 5157 9207 283
1 6452 6474 7092
2 1264 5009 5123
3 86 7225 1025
4 7787 5134 637
5 9406 6119 8719
6 7479 1493 1525
7 4098 7248 7618
8 9921 7925 8547
9 2320 764 1564
1.加入未分类的MultiIndex
列:
In [42]: %timeit df1.join(df2, on=['key1','key2'])
100 loops, best of 3: 11.1 ms per loop
2.然后先排序然后加入(排序不用于时间):
df2 = df2.sort_index()
In [44]: %timeit df1.join(df2, on=['key1','key2'])
100 loops, best of 3: 10.5 ms per loop
3. map
解决方案,同时加入MultiIndex
不计入时间,如果仍然只运行一次相同的数据:
df2.index = ['{}_{}'.format(a,b) for a, b in df2.index]
df1['joined'] = df1['key1'].astype(str) + '_' + df1['key2'].astype(str)
In [51]: %timeit df1['col'] = df1['joined'].map(df2['c'])
1000 loops, best of 3: 371 µs per loop
In [55]: %%timeit
...: df1['joined'] = df1['key1'].astype(str) + '_' + df1['key2'].astype(str)
...: df1['col'] = df1['joined'].map(df2['c'])
...:
1000 loops, best of 3: 1.08 ms per loop
答案 1 :(得分:0)
我认为这不是fill value
问题,而是filter index
问题。请尝试以下:
df1 = df1.set_index(["key1", "key2"])
result = df2.loc[df1.index, :].reset_index()
如果您想保留原始列:
result.columns=["key1", "key2", "val"]