根据索引选择pandas数据帧

时间:2016-02-11 19:39:01

标签: python pandas dataframe

我有一个数据框,我想删除某些特定的重复行:

import numpy as np
import pandas as pd
nrows = 144    
df = pd.DataFrame(np.random.rand(nrows,), pd.date_range('2016-02-08 00:00:00', periods=nrows, freq='2h'), columns=['A'])

数据框是随时间变化的,无限制地每两小时提供一次数据,但我选择只显示一个子集以简洁。我想在星期一8点开始每隔72小时删除一次数据,以配合一个改变数据的外部事件。对于这个数据快照,我想删除2016-02-08 08:00,2016-02-11 08:00,+ 3D等索引的行。

有一种简单的方法吗?

1 个答案:

答案 0 :(得分:0)

IIUC你可以这样做:

In [18]:    
start = df.index[(df.index.dayofweek == 0) & (df.index.hour == 8)][0]
start

Out[18]:
Timestamp('2016-02-08 08:00:00')

In [45]:
df.loc[df.index.difference(pd.date_range(start, end=df.index[-1], freq='3D'))]

Out[45]:
                            A
2016-02-08 00:00:00  0.323742
2016-02-08 02:00:00  0.962252
2016-02-08 04:00:00  0.706537
2016-02-08 06:00:00  0.561446
2016-02-08 10:00:00  0.225042
2016-02-08 12:00:00  0.746258
2016-02-08 14:00:00  0.167950
2016-02-08 16:00:00  0.199958
2016-02-08 18:00:00  0.808286
2016-02-08 20:00:00  0.288797
2016-02-08 22:00:00  0.508109
2016-02-09 00:00:00  0.980772
2016-02-09 02:00:00  0.995731
2016-02-09 04:00:00  0.742751
2016-02-09 06:00:00  0.392247
2016-02-09 08:00:00  0.460511
2016-02-09 10:00:00  0.083660
2016-02-09 12:00:00  0.273620
2016-02-09 14:00:00  0.791506
2016-02-09 16:00:00  0.440630
2016-02-09 18:00:00  0.326418
2016-02-09 20:00:00  0.790780
2016-02-09 22:00:00  0.521131
2016-02-10 00:00:00  0.219315
2016-02-10 02:00:00  0.016625
2016-02-10 04:00:00  0.958566
2016-02-10 06:00:00  0.405643
2016-02-10 08:00:00  0.958025
2016-02-10 10:00:00  0.786663
2016-02-10 12:00:00  0.589064
...                       ...
2016-02-17 12:00:00  0.360848
2016-02-17 14:00:00  0.757499
2016-02-17 16:00:00  0.391574
2016-02-17 18:00:00  0.062812
2016-02-17 20:00:00  0.308282
2016-02-17 22:00:00  0.251520
2016-02-18 00:00:00  0.832871
2016-02-18 02:00:00  0.387108
2016-02-18 04:00:00  0.070969
2016-02-18 06:00:00  0.298831
2016-02-18 08:00:00  0.878526
2016-02-18 10:00:00  0.979233
2016-02-18 12:00:00  0.386620
2016-02-18 14:00:00  0.420962
2016-02-18 16:00:00  0.238879
2016-02-18 18:00:00  0.124069
2016-02-18 20:00:00  0.985828
2016-02-18 22:00:00  0.585278
2016-02-19 00:00:00  0.409226
2016-02-19 02:00:00  0.093945
2016-02-19 04:00:00  0.389450
2016-02-19 06:00:00  0.378091
2016-02-19 08:00:00  0.874232
2016-02-19 10:00:00  0.527629
2016-02-19 12:00:00  0.490236
2016-02-19 14:00:00  0.509008
2016-02-19 16:00:00  0.097061
2016-02-19 18:00:00  0.111626
2016-02-19 20:00:00  0.877099
2016-02-19 22:00:00  0.796201

[140 rows x 1 columns]

因此,这会通过比较dayofweekhour并获取第一个索引值来确定起始范围,然后使用date_range生成索引,并在difference上调用{{3}} index删除这些行并将其传递给loc