当索引不对齐时,pandas DataFrame更新/组合

时间:2015-08-05 11:41:42

标签: python pandas dataframe

考虑两个数据帧,它们存储相同观察的相同特征的信息,但是在不同的时间段内:

import pandas as pd
import numpy as np

df1 = pd.DataFrame({"obs":["a","a","b","b"],
    "year":[1,2,1,2],
    "val":[3, np.NaN, 3, np.NaN]})

df1

Out:
   obs  val  year
0    a    3     1
1    a  NaN     2
2    b    3     1
3    b  NaN     2

df2 = pd.DataFrame({"obs":["a","a","b","b"],
    "val":[np.NaN, 4, np.NaN, 4],
    "year":[1,2,1,2]})
df2.index = (range(5,9))

df2

Out:
   obs  val  year
5    a  NaN     1
6    a    4     2
7    b  NaN     1
8    b    4     2

现在我想合并或合并这两个数据框,以便将值收集在一个列中,NaN中的df1替换为来自{{df2的相应观察年值1}}。 我可以这样做:

merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
merged.loc[~np.isfinite(merged.val_x), 'val_x'] = merged[~np.isfinite(merged.val_x)].val_y

即。基本上进行常规合并,然后手动将一列中的NaN替换为另一列的值。

有更好/更简洁的方法吗?我觉得某种df.combinedf.combine_firstdf.update会做我想要的事情,但它们似乎都与指数一致。

1 个答案:

答案 0 :(得分:2)

我将假设您的目标是获取merged['val_x']并且您真的不关心merged中的其他列。

以下是一些选项:

def using_merge(df1, df2):
    merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
    mask = ~np.isfinite(merged.val_x)
    merged.loc[mask, 'val_x'] = merged.loc[mask, 'val_y']
    return merged['val_x']

def using_update(df1, d2):
    merged = pd.merge(df1, df2, on=["obs", "year"], how="left")
    merged['val_y'].update(merged['val_x'])
    return merged['val_y']

def using_set_index(df1, df2):
    df1 = df1.set_index(['obs','year'])
    df2 = df2.set_index(['obs','year'])
    return df1['val'].combine_first(df2['val'])

没有比其他人简洁得多。但是有一点性能差异:

import numpy as np
import pandas as pd
import itertools as IT

# generate a large-ish example
np.random.seed(2015)
N, M = 200, 200
df1 = pd.DataFrame(list(IT.product(np.arange(N), np.arange(M))), 
                   columns=['obs','year'])
df1['val'] = np.random.choice([1,2,np.nan], size=len(df1))

df2 = pd.DataFrame(list(IT.product(np.arange(N), np.arange(M))), 
                   columns=['obs','year'])
df2['val'] = np.random.choice([1,2,np.nan], size=len(df1))
df2.index = np.arange(len(df2)) + len(df1)

m1 = using_merge(df1, df2)
m2 = using_update(df1, df2)
m3 = using_set_index(df1, df2)
assert m3.reset_index(drop=True).equals(m1)
assert m1.equals(m2)
In [158]: %timeit using_merge(df1, df2)
100 loops, best of 3: 13.6 ms per loop

In [159]: %timeit using_update(df1, df2)
100 loops, best of 3: 12.3 ms per loop

In [160]: %timeit using_set_index(df1, df2)
100 loops, best of 3: 8 ms per loop

因此,对于较大的DataFrame,设置索引是值得的,然后使用combine_first