Question

我有两个DataFrames（带DatetimeIndex），并希望使用第二帧（较新的帧）中的数据更新第一帧（较旧的帧）。

新帧可能包含旧帧中已包含的行的最新数据。在这种情况下，旧帧中的数据应该被来自新帧的数据覆盖。此外，较新的帧可能具有比第一个更多的列/行。在这种情况下，旧帧应该被新帧中的数据放大。

Pandas docs声明，

“.loc/.ix/[]操作可以在为该轴设置不存在的键时执行放大”

和

“可以通过.loc”

在任一轴上放大数据框架

然而，这似乎不起作用并抛出KeyError。例如：

In [195]: df1
Out[195]: 
                     A  B  C
2015-07-09 12:00:00  1  1  1
2015-07-09 13:00:00  1  1  1
2015-07-09 14:00:00  1  1  1
2015-07-09 15:00:00  1  1  1

In [196]: df2
Out[196]: 
                     A  B  C  D
2015-07-09 14:00:00  2  2  2  2
2015-07-09 15:00:00  2  2  2  2
2015-07-09 16:00:00  2  2  2  2
2015-07-09 17:00:00  2  2  2  2

In [197]: df1.loc[df2.index] = df2
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-197-74e630e87cf8> in <module>()
----> 1 df1.loc[df2.index] = df2

/.../pandas/core/indexing.pyc in __setitem__(self, key, value)
    112 
    113     def __setitem__(self, key, value):
--> 114         indexer = self._get_setitem_indexer(key)
    115         self._setitem_with_indexer(indexer, value)
    116 

/.../pandas/core/indexing.pyc in _get_setitem_indexer(self, key)
    107 
    108         try:
--> 109             return self._convert_to_indexer(key, is_setter=True)
    110         except TypeError:
    111             raise IndexingError(key)

/.../pandas/core/indexing.pyc in _convert_to_indexer(self, obj, axis, is_setter)
   1110                 mask = check == -1
   1111                 if mask.any():
-> 1112                     raise KeyError('%s not in index' % objarr[mask])
   1113 
   1114                 return _values_from_object(indexer)

KeyError: "['2015-07-09T18:00:00.000000000+0200' '2015-07-09T19:00:00.000000000+0200'] not in index"

最佳方式是什么（关于性能，因为我的实际数据要大得多）两个实现了所需的更新和放大的DataFrame。这是我希望看到的结果：

                     A  B  C    D
2015-07-09 12:00:00  1  1  1  NaN
2015-07-09 13:00:00  1  1  1  NaN
2015-07-09 14:00:00  2  2  2    2
2015-07-09 15:00:00  2  2  2    2
2015-07-09 16:00:00  2  2  2    2
2015-07-09 17:00:00  2  2  2    2

Answer 1

df2.combine_first(df1)（documentation）似乎满足你的要求; PFB代码段＆amp;输出

import pandas as pd

print 'pandas-version: ', pd.__version__

df1 = pd.DataFrame.from_records([('2015-07-09 12:00:00',1,1,1),
                                 ('2015-07-09 13:00:00',1,1,1),
                                 ('2015-07-09 14:00:00',1,1,1),
                                 ('2015-07-09 15:00:00',1,1,1)],
                                columns=['Dt', 'A', 'B', 'C']).set_index('Dt')
# print df1

df2 = pd.DataFrame.from_records([('2015-07-09 14:00:00',2,2,2,2),
                                 ('2015-07-09 15:00:00',2,2,2,2),
                                 ('2015-07-09 16:00:00',2,2,2,2),
                                 ('2015-07-09 17:00:00',2,2,2,2),],
                               columns=['Dt', 'A', 'B', 'C', 'D']).set_index('Dt')
res_combine1st = df2.combine_first(df1)
print res_combine1st

输出

pandas-version:  0.15.2
                     A  B  C   D
Dt                              
2015-07-09 12:00:00  1  1  1 NaN
2015-07-09 13:00:00  1  1  1 NaN
2015-07-09 14:00:00  2  2  2   2
2015-07-09 15:00:00  2  2  2   2
2015-07-09 16:00:00  2  2  2   2
2015-07-09 17:00:00  2  2  2   2

Answer 2

您可以使用combine功能。

import pandas as pd

# your data
# ===========================================================
df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))

df2 = pd.DataFrame(np.ones(16).reshape(4,4)*2, columns='A B C D'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))

# processing
# =====================================================
# reindex to populate NaN
result = df2.reindex(np.union1d(df1.index, df2.index))

Out[248]: 
                      A   B   C   D
2015-07-09 12:00:00 NaN NaN NaN NaN
2015-07-09 13:00:00 NaN NaN NaN NaN
2015-07-09 14:00:00   2   2   2   2
2015-07-09 15:00:00   2   2   2   2
2015-07-09 16:00:00   2   2   2   2
2015-07-09 17:00:00   2   2   2   2

combiner = lambda x, y: np.where(x.isnull(), y, x)

# use df1 to update result
result.combine(df1, combiner)

Out[249]: 
                     A  B  C   D
2015-07-09 12:00:00  1  1  1 NaN
2015-07-09 13:00:00  1  1  1 NaN
2015-07-09 14:00:00  2  2  2   2
2015-07-09 15:00:00  2  2  2   2
2015-07-09 16:00:00  2  2  2   2
2015-07-09 17:00:00  2  2  2   2

# maybe fillna(method='ffill') if you like

Answer 3

除了上一个答案，重建索引后还可以使用

result.fillna(df1, inplace=True)

所以基于Jianxun Li的代码（再扩展一列）你可以试试这个

# your data
# ===========================================================
df1 = pd.DataFrame(np.ones(12).reshape(4,3), columns='A B C'.split(), index=pd.date_range('2015-07-09 12:00:00', periods=4, freq='H'))
df2 = pd.DataFrame(np.ones(20).reshape(4,5)*2, columns='A B C D E'.split(), index=pd.date_range('2015-07-09 14:00:00', periods=4, freq='H'))

# processing
# =====================================================
# reindex to populate NaN
result = df2.reindex(np.union1d(df1.index, df2.index))
# fill NaN from df1
result.fillna(df1, inplace=True)

Out[3]:             
                     A  B  C   D   E
2015-07-09 12:00:00  1  1  1 NaN NaN
2015-07-09 13:00:00  1  1  1 NaN NaN
2015-07-09 14:00:00  2  2  2   2   2
2015-07-09 15:00:00  2  2  2   2   2
2015-07-09 16:00:00  2  2  2   2   2
2015-07-09 17:00:00  2  2  2   2   2

使用放大设置DataFrame值

3 个答案:

输出