使用reindex向数据框添加缺少日期会替换数据

时间:2016-10-26 11:41:36

标签: python pandas dataframe difference reindex

我正在尝试在数据框中添加缺少的日期。

我看过这些帖子:reindexreindex2

当我尝试重新索引数据框时:

print(df)
df = df.reindex(dates, fill_value=0)
print(df)

我得到以下输出:

_updated_at         Name        hour day date       time      data1     data2
06/06/2016 13:27    game_name   13  6    06/06/2016 evening   0         0
07/06/2016 10:33    game_name   10  7    07/06/2016 morning   145.2788  122.7361
18/10/2016 14:34    game_name   14  18   18/10/2016 evening   0         0
19/10/2016 17:12    game_name   17  19   19/10/2016 evening   0         0
24/10/2016 11:05    game_name   11  24   24/10/2016 morning   313.5954  364.4107
24/10/2016 12:02    game_name   12  24   24/10/2016 evening   0         0
25/10/2016 08:50    game_name   8   25   25/10/2016 morning   362.4682  431.5803
25/10/2016 13:00    game_name   13  25   25/10/2016 evening   0         0


_updated_at Name hour day date  time data1  data2
24/10/2016  0    0    0   0     0    0      0
25/10/2016  0    0    0   0     0    0      0
26/10/2016  0    0    0   0     0    0      0
27/10/2016  0    0    0   0     0    0      0
28/10/2016  0    0    0   0     0    0      0
29/10/2016  0    0    0   0     0    0      0
30/10/2016  0    0    0   0     0    0      0

我希望看到缺少日期的行填充新行,每个值填充0,而不是用0替换所有行。

修改 总体目标是能够计算导致每天早晚差异的值之间的差异。

EDIT2: 当前输出:

print (df.reindex(mux, fill_value=0).groupby(level=0)['data1'].diff(-1).dropna())

dtypes: float64(2)None
2016-06-06  morning       0.00000
2016-06-07  morning     440.99582
2016-06-08  morning       0.00000
2016-06-09  morning       0.00000
2016-06-10  morning       0.00000

print (df.reindex(mux, fill_value=0).groupby(level=0)['data2'].diff(-1).dropna())

Length: 142, dtype: float64
2016-06-06  morning    -220.5481
2016-06-07  morning       0.0000
2016-06-08  morning       0.0000
2016-06-09  morning       0.0000
2016-06-10  morning       0.0000
2016-06-11  morning       0.0000

我希望看到evening

1 个答案:

答案 0 :(得分:1)

您可以reindexdatestimedf.date = pd.to_datetime(df.date) dates = pd.date_range(start=df.date.min(), end=df.date.max()) print (dates) DatetimeIndex(['2016-06-06', '2016-06-07', '2016-06-08', '2016-06-09', '2016-06-10', '2016-06-11', '2016-06-12', '2016-06-13', '2016-06-14', '2016-06-15', ... '2016-10-16', '2016-10-17', '2016-10-18', '2016-10-19', '2016-10-20', '2016-10-21', '2016-10-22', '2016-10-23', '2016-10-24', '2016-10-25'], dtype='datetime64[ns]', length=142, freq='D') mux = pd.MultiIndex.from_product([dates,['morning','evening']]) #print (mux) df.set_index(['date','time'], inplace=True) print (df.reindex(mux, fill_value=0)) _updated_at Name hour day data1 data2 2016-06-06 morning 0 0 0 0 0.0000 0.0000 evening 06/06/2016 13:27 game_name 13 6 0.0000 0.0000 2016-06-07 morning 0 0 0 0 0.0000 0.0000 evening 0 0 0 0 0.0000 0.0000 2016-06-08 morning 0 0 0 0 0.0000 0.0000 evening 0 0 0 0 0.0000 0.0000 2016-06-09 morning 0 0 0 0 0.0000 0.0000 evening 0 0 0 0 0.0000 0.0000 2016-06-10 morning 0 0 0 0 0.0000 0.0000 evening 0 0 0 0 0.0000 0.0000 2016-06-11 morning 0 0 0 0 0.0000 0.0000 evening 0 0 0 0 0.0000 0.0000 2016-06-12 morning 0 0 0 0 0.0000 0.0000 evening 0 0 0 0 0.0000 0.0000 2016-06-13 morning 0 0 0 0 0.0000 0.0000 ... MultiIndex.from_product开始{/ 3}}:

Multiindex

最后,您groupby的第一级NaN(日期)可以DataFrameGroupBy.diff。每个日期行都会显示print (df.reindex(mux, fill_value=0).groupby(level=0)['data1','data2'].diff(-1).dropna()) data1 data2 2016-06-06 morning 0.0000 0.0000 2016-06-07 morning 0.0000 0.0000 2016-06-08 morning 0.0000 0.0000 2016-06-09 morning 0.0000 0.0000 2016-06-10 morning 0.0000 0.0000 2016-06-11 morning 0.0000 0.0000 2016-06-12 morning 0.0000 0.0000 2016-06-13 morning 0.0000 0.0000 2016-06-14 morning 0.0000 0.0000 2016-06-15 morning 0.0000 0.0000 2016-06-16 morning 0.0000 0.0000 2016-06-17 morning 0.0000 0.0000 2016-06-18 morning 0.0000 0.0000 2016-06-19 morning 0.0000 0.0000 2016-06-20 morning 0.0000 0.0000 2016-06-21 morning 0.0000 0.0000 ... ... ,可以通过dropna删除

print (df.reindex(mux, fill_value=0)
         .groupby(level=0)
         .apply(lambda x: x.ix[0, ['data1','data2']]-x.ix[1, ['data1','data2']]))

               data1     data2
2016-06-06    0.0000    0.0000
2016-06-07    0.0000    0.0000
2016-06-08    0.0000    0.0000
2016-06-09    0.0000    0.0000
2016-06-10    0.0000    0.0000
2016-06-11    0.0000    0.0000
2016-06-12    0.0000    0.0000
2016-06-13    0.0000    0.0000
2016-06-14    0.0000    0.0000
2016-06-15    0.0000    0.0000
2016-06-16    0.0000    0.0000
2016-06-17    0.0000    0.0000
2016-06-18    0.0000    0.0000
2016-06-19    0.0000    0.0000
2016-06-20    0.0000    0.0000
2016-06-21    0.0000    0.0000
2016-06-22    0.0000    0.0000
2016-06-23    0.0000    0.0000
2016-06-24    0.0000    0.0000
2016-06-25    0.0000    0.0000
2016-06-26    0.0000    0.0000
2016-06-27    0.0000    0.0000
2016-06-28    0.0000    0.0000
2016-06-29    0.0000    0.0000
2016-06-30    0.0000    0.0000
2016-07-01    0.0000    0.0000
2016-07-02    0.0000    0.0000
2016-07-03    0.0000    0.0000
2016-07-04    0.0000    0.0000
2016-07-05    0.0000    0.0000
             ...       ...
2016-09-26    0.0000    0.0000
2016-09-27    0.0000    0.0000
2016-09-28    0.0000    0.0000
2016-09-29    0.0000    0.0000
2016-09-30    0.0000    0.0000
2016-10-01    0.0000    0.0000
2016-10-02    0.0000    0.0000
2016-10-03    0.0000    0.0000
2016-10-04    0.0000    0.0000
2016-10-05    0.0000    0.0000
2016-10-06    0.0000    0.0000
2016-10-07    0.0000    0.0000
2016-10-08    0.0000    0.0000
2016-10-09    0.0000    0.0000
2016-10-10    0.0000    0.0000
2016-10-11    0.0000    0.0000
2016-10-12    0.0000    0.0000
2016-10-13    0.0000    0.0000
2016-10-14    0.0000    0.0000
2016-10-15    0.0000    0.0000
2016-10-16    0.0000    0.0000
2016-10-17    0.0000    0.0000
2016-10-18    0.0000    0.0000
2016-10-19    0.0000    0.0000
2016-10-20    0.0000    0.0000
2016-10-21    0.0000    0.0000
2016-10-22    0.0000    0.0000
2016-10-23    0.0000    0.0000
2016-10-24  313.5954  364.4107
2016-10-25  362.4682  431.5803

[142 rows x 2 columns]

您也可以按ix选择并减去:

CASE WHEN