我是一个带有不规则日期时间索引的pandas数据框。现在我想基于连续的连续观察来索引数据帧。换句话说,我只想保留值x
或更多连续观察值。
采用以下示例:
idx = pd.DatetimeIndex(['2003-04-11', '2003-04-12', '2003-04-13','2003-04-17','2003-05-02', '2003-05-03', '2003-05-04','2003-07-23', '2003-07-24'])
df = pd.DataFrame(np.random.random((9,2)),index=idx)
df
0 1
2003-04-11 0.954287 0.331016
2003-04-12 0.553477 0.858590
2003-04-13 0.179510 0.103970
2003-04-17 0.608664 0.746860
2003-05-02 0.691829 0.081192
2003-05-03 0.790748 0.319989
2003-05-04 0.955903 0.668918
2003-07-23 0.630201 0.297902
2003-07-24 0.692403 0.847222
来自2003-04-11 ~ 13
的连续3次观察,然后是2003-04-17
上的单次观察,而来自2003-05-02 ~ 04
的另外3次连续观察,并且来自2003-07-23 ~ 24
的两次连续观察结束。
如何将连续3天或更长时间的观察结果编入索引?在这个例子中,它应该保留以下观察结果:
0 1
2003-04-11 0.954287 0.331016
2003-04-12 0.553477 0.858590
2003-04-13 0.179510 0.103970
2003-05-02 0.691829 0.081192
2003-05-03 0.790748 0.319989
2003-05-04 0.955903 0.668918
答案 0 :(得分:2)
虽然接受了答案,但您可以尝试不同的方法:
df1 = df.loc[df.groupby((~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int).cumsum() ).transform(len).iloc[:, 0] == 3]
print df1
0 1
2003-04-11 0.350339 0.904514
2003-04-12 0.903141 0.423335
2003-04-13 0.394534 0.803299
2003-05-02 0.158032 0.565684
2003-05-03 0.715311 0.772509
2003-05-04 0.136462 0.533705
一步一步:
print ~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))
#2003-04-11 True
#2003-04-12 False
#2003-04-13 False
#2003-04-17 True
#2003-05-02 True
#2003-05-03 False
#2003-05-04 False
#2003-07-23 True
#2003-07-24 False
#dtype: bool
print (~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int)
#2003-04-11 1
#2003-04-12 0
#2003-04-13 0
#2003-04-17 1
#2003-05-02 1
#2003-05-03 0
#2003-05-04 0
#2003-07-23 1
#2003-07-24 0
#dtype: int32
print (~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int).cumsum()
#2003-04-11 1
#2003-04-12 1
#2003-04-13 1
#2003-04-17 2
#2003-05-02 3
#2003-05-03 3
#2003-05-04 3
#2003-07-23 4
#2003-07-24 4
#dtype: int32
print df.groupby((~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len)
# 0 1
#2003-04-11 3 3
#2003-04-12 3 3
#2003-04-13 3 3
#2003-04-17 1 1
#2003-05-02 3 3
#2003-05-03 3 3
#2003-05-04 3 3
#2003-07-23 2 2
#2003-07-24 2 2
print df.groupby((~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len).iloc[:, 0]
#2003-04-11 3
#2003-04-12 3
#2003-04-13 3
#2003-04-17 1
#2003-05-02 3
#2003-05-03 3
#2003-05-04 3
#2003-07-23 2
#2003-07-24 2
#Name: 0, dtype: float64
print df.groupby((~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len).iloc[:, 0] == 3
#2003-04-11 True
#2003-04-12 True
#2003-04-13 True
#2003-04-17 False
#2003-05-02 True
#2003-05-03 True
#2003-05-04 True
#2003-07-23 False
#2003-07-24 False
#Name: 0, dtype: bool
print df.loc[df.groupby((~(df.index.to_series().diff() == pd.Timedelta(1, unit='d'))).astype(int).cumsum()).transform(len).iloc[:, 0] == 3]
# 0 1
#2003-04-11 0.120301 0.635707
#2003-04-12 0.747283 0.681601
#2003-04-13 0.118192 0.777899
#2003-05-02 0.481396 0.294547
#2003-05-03 0.619790 0.058048
#2003-05-04 0.179386 0.348843
答案 1 :(得分:1)
这假设索引已排序并且所有值都在升序,基本上我们确定相隔2行减去行标签时相差2天的行(使用shift
)然后执行列表理解以生成范围,对它们进行排序并使用它们来使用loc
进行索引:
In [133]:
row_labels = df.index[(df.index.to_series() - df.index.to_series().shift(2)) == pd.Timedelta(2, unit='d')]
rows = [x - pd.Timedelta(n, unit='d') for n in range(0,3) for x in row_labels]
rows = sorted(rows)
df.loc[rows]
Out[133]:
0 1
2003-04-11 0.352054 0.228887
2003-04-12 0.776784 0.594784
2003-04-13 0.137554 0.852900
2003-05-02 0.589869 0.574012
2003-05-03 0.061270 0.590426
2003-05-04 0.245350 0.340445
您可以看到初始计算的结果:
In [134]:
df.index[(df.index.to_series() - df.index.to_series().shift(2)) == pd.Timedelta(2, unit='d')]
Out[134]:
DatetimeIndex(['2003-04-13', '2003-05-04'], dtype='datetime64[ns]', freq=None)