使用熊猫中的NaN值检测所有行的连续时间戳

时间:2018-10-11 15:27:03

标签: python-3.x pandas

我想在一个数据帧中检测连续的行集的开始和结束(Datetime),所有值均为NaN。

将结果存储在元组数组中且每组日期时间的开始和结束均带有NaN值的最佳方法是什么?

例如,使用下面的数据框,元组应如下所示:

missing_datetimes = [('2018-10-10 22:00:00', '2018-10-11 00:00:00 '),
('2018-10-11 02:00:00','2018-10-11 02:00:00'), ('2018-10-11 04:00:00', '2018-10-11 04:00:00')

数据框示例:

-------------+---------------------+------------+------------+
| geo_id     | Datetime            |  Variable1 |  Variable2 |    
+------------+---------------------+------------+------------+
| 1          | 2018-10-10 18:00:00 |     20     |     10     |
| 2          | 2018-10-10 18:00:00 |     22     |     10     |
| 1          | 2018-10-10 19:00:00 |     20     |     nan    |
| 2          | 2018-10-10 19:00:00 |     21     |     nan    |
| 1          | 2018-10-10 20:00:00 |     30     |     nan    |
| 2          | 2018-10-10 20:00:00 |     30     |     nan    |
| 1          | 2018-10-10 21:00:00 |     nan    |     5      |
| 2          | 2018-10-10 21:00:00 |     nan    |     5      |
| 1          | 2018-10-10 22:00:00 |     nan    |     nan    |
| 1          | 2018-10-10 23:00:00 |     nan    |     nan    |
| 1          | 2018-10-11 00:00:00 |     nan    |     nan    |
| 1          | 2018-10-11 01:00:00 |     5      |     2      |
| 1          | 2018-10-11 02:00:00 |     nan    |     nan    |
| 1          | 2018-10-11 03:00:00 |     2      |     1      |
| 1          | 2018-10-11 04:00:00 |     nan    |     nan    |
+------------+---------------------+------------+------------+

更新:如果某些日期时间重复了怎么办?

1 个答案:

答案 0 :(得分:2)

您可能需要在条件

下使用groupby
s=df.set_index('Datetime').isnull().all(axis=1)

df.loc[s,'Datetime'].groupby((~s).cumsum()[s]).agg(['first','last']).apply(tuple,1).tolist()
# find the all nan value and if they are consecutive we pull them into one group

Out[89]: 
[('2018-10-1022:00:00', '2018-10-1100:00:00'),
 ('2018-10-1102:00:00', '2018-10-1102:00:00'),
 ('2018-10-1104:00:00', '2018-10-1104:00:00')]