基于日期时间索引的掩码数据框列

时间:2018-12-13 15:28:41

标签: python pandas dask

this question非常相似,除了我需要同时考虑日期和时间; indexer_between_time似乎不支持我可以找到的任何日期时间格式。

我有一个像这样的dask数据框:

                     logger_volt        lat     lon
time                                               
2017-01-01 00:01:20      12.0112  37.150902 -98.362
2017-01-01 00:01:40      12.0113  37.150902 -98.362
2017-01-01 00:02:00      12.0057  37.150902 -98.362
2017-01-01 00:02:20      12.0113  37.150902 -98.362
2017-01-01 00:02:40      12.0058  37.150902 -98.362
2017-01-01 00:03:00      12.0113  37.150902 -98.362

以及在特定时间范围内屏蔽的列列表(这些范围内的数据被认为是“错误的”,应该在其中返回None),形式为python元组或列表:

[   # var       start of mask           end of mask
    ('lat', '2017-01-01 00:01:40', '2017-01-01 00:02:00'),
    ('lon', '2017-01-01 00:02:40', '2017-01-01 00:03:00'),
]

所需结果:

                     logger_volt        lat     lon
time                                               
2017-01-01 00:01:20      12.0112  37.150902 -98.362
2017-01-01 00:01:40      12.0113       None -98.362
2017-01-01 00:02:00      12.0057       None -98.362
2017-01-01 00:02:20      12.0113  37.150902 -98.362
2017-01-01 00:02:40      12.0058  37.150902    None
2017-01-01 00:03:00      12.0113  37.150902    None

非工作代码:

dqrs = [   # var       start of mask           end of mask
    ('lat', '2017-01-01 00:01:40', '2017-01-01 00:02:00'),
    ('lon', '2017-01-01 00:02:40', '2017-01-01 00:03:00'),
]
df = xarray.open_dataset('filename.cdf').to_dask_dataframe()

dqr_mask = (df == df) | df.isnull()  # create a dummy mask that's all True
for var, start, end in dqrs:
    dqr_mask |= ((df.columns == var) & (df.index >= start) & (df.index >= end))

df = df.mask(dqr_mask).compute()

其他方法存在的问题:

  • Dask数据框尚未实现切片分配,因此df[start:end] = None之类的内容无法使用

1 个答案:

答案 0 :(得分:1)

您只需要在循环var中选择要修改的dqr_mask的列for。这是一种方法:

dqr_mask = df != df # you want a mask set to False where there is a value
for var, start, end in dqrs:
    #set to True the column var when index is between start and end
    dqr_mask[var] |= (df.index >= start) & (df.index <= end) 
# where dqr_mask False it keeps df otherwise it set the value to None
df = df.mask(dqr_mask,other=None)

print (df)
                    logger_volt      lat     lon
time                                            
2017-01-01 00:01:20     12.0112  37.1509 -98.362
2017-01-01 00:01:40     12.0113     None -98.362
2017-01-01 00:02:00     12.0057     None -98.362
2017-01-01 00:02:20     12.0113  37.1509 -98.362
2017-01-01 00:02:40     12.0058  37.1509    None
2017-01-01 00:03:00     12.0113  37.1509    None