考虑两个数据帧,它们存储相同观察的相同特征的信息,但是在不同的时间段内:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({"obs":["a","a","b","b"],
"year":[1,2,1,2],
"val":[3, np.NaN, 3, np.NaN]})
df1
Out:
obs val year
0 a 3 1
1 a NaN 2
2 b 3 1
3 b NaN 2
df2 = pd.DataFrame({"obs":["a","a","b","b"],
"val":[np.NaN, 4, np.NaN, 4],
"year":[1,2,1,2]})
df2.index = (range(5,9))
df2
Out:
obs val year
5 a NaN 1
6 a 4 2
7 b NaN 1
8 b 4 2
现在我想合并或合并这两个数据框,以便将值收集在一个列中,NaN
中的df1
替换为来自{{df2
的相应观察年值1}}。
我可以这样做:
merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
merged.loc[~np.isfinite(merged.val_x), 'val_x'] = merged[~np.isfinite(merged.val_x)].val_y
即。基本上进行常规合并,然后手动将一列中的NaN
替换为另一列的值。
有更好/更简洁的方法吗?我觉得某种df.combine
,df.combine_first
,df.update
会做我想要的事情,但它们似乎都与指数一致。
答案 0 :(得分:2)
我将假设您的目标是获取merged['val_x']
并且您真的不关心merged
中的其他列。
以下是一些选项:
def using_merge(df1, df2):
merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
mask = ~np.isfinite(merged.val_x)
merged.loc[mask, 'val_x'] = merged.loc[mask, 'val_y']
return merged['val_x']
def using_update(df1, d2):
merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
merged['val_y'].update(merged['val_x'])
return merged['val_y']
def using_set_index(df1, df2):
df1 = df1.set_index(['obs','year'])
df2 = df2.set_index(['obs','year'])
return df1['val'].combine_first(df2['val'])
没有比其他人简洁得多。但是有一点性能差异:
import numpy as np
import pandas as pd
import itertools as IT
# generate a large-ish example
np.random.seed(2015)
N, M = 200, 200
df1 = pd.DataFrame(list(IT.product(np.arange(N), np.arange(M))),
columns=['obs','year'])
df1['val'] = np.random.choice([1,2,np.nan], size=len(df1))
df2 = pd.DataFrame(list(IT.product(np.arange(N), np.arange(M))),
columns=['obs','year'])
df2['val'] = np.random.choice([1,2,np.nan], size=len(df1))
df2.index = np.arange(len(df2)) + len(df1)
m1 = using_merge(df1, df2)
m2 = using_update(df1, df2)
m3 = using_set_index(df1, df2)
assert m3.reset_index(drop=True).equals(m1)
assert m1.equals(m2)
In [158]: %timeit using_merge(df1, df2)
100 loops, best of 3: 13.6 ms per loop
In [159]: %timeit using_update(df1, df2)
100 loops, best of 3: 12.3 ms per loop
In [160]: %timeit using_set_index(df1, df2)
100 loops, best of 3: 8 ms per loop
因此,对于较大的DataFrame,设置索引是值得的,然后使用combine_first
。