如何使用pandas识别大约(阈值定义的)连续非空数据?

时间:2015-09-11 10:12:54

标签: python numpy pandas scipy time-series

我想从降雨时间系列中提取降雨事件,同时在同一事件中允许X干小时(作为参数)。因此,通过降雨事件,我的意思是大约连续降雨(RF> 0),其中最大X连续干燥时间(RF = 0)。

我实际上并不想用迭代器和增量来做这件事,我会寻找可以放心的pandas或numpy / scipy工具。

以下是我的数据框示例。 RF是原始降雨量,RFfill是RF.interpolate()来填充nodata。 evtId是为了存储事件唯一ID而创建的字段。

                    TS   RF  RFfill  evtId
0  1997-11-27 14:00:00  0.3     0.3    NaN
1  1997-11-27 15:00:00  1.1     1.1    NaN
2  1997-11-27 16:00:00  0.2     0.2    NaN
3  1997-11-27 17:00:00  0.0     0.0    NaN
4  1997-11-27 18:00:00  0.0     0.0    NaN
5  1997-11-27 19:00:00  1.1     1.1    NaN
6  1997-11-27 20:00:00  0.6     0.6    NaN
7  1997-11-27 21:00:00  0.0     0.0    NaN
8  1997-11-27 22:00:00  0.0     0.0    NaN
9  1997-11-27 23:00:00  0.0     0.0    NaN
10 1997-11-28 00:00:00  0.0     0.0    NaN
11 1997-11-28 01:00:00  0.0     0.0    NaN
12 1997-11-28 02:00:00  0.0     0.0    NaN
13 1997-11-28 03:00:00  0.0     0.0    NaN
14 1997-11-28 04:00:00  0.0     0.0    NaN
15 1997-11-28 05:00:00  0.0     0.0    NaN
16 1997-11-28 06:00:00  0.0     0.0    NaN
17 1997-11-28 07:00:00  0.0     0.0    NaN
18 1997-11-28 08:00:00  0.0     0.0    NaN
19 1997-11-28 09:00:00  0.8     0.8    NaN
20 1997-11-28 10:00:00  1.1     1.1    NaN
21 1997-11-28 11:00:00  2.3     2.3    NaN
22 1997-11-28 12:00:00  1.4     1.4    NaN
23 1997-11-28 13:00:00  0.4     0.4    NaN
24 1997-11-28 14:00:00  0.2     0.2    NaN
25 1997-11-28 15:00:00  0.0     0.0    NaN
26 1997-11-28 16:00:00  0.0     0.0    NaN
27 1997-11-28 17:00:00  0.0     0.0    NaN
28 1997-11-28 18:00:00  0.0     0.0    NaN
29 1997-11-28 19:00:00  0.0     0.0    NaN
30 1997-11-28 20:00:00  0.0     0.0    NaN

以下为允许干燥时间为5小时的预期产量:

                    TS   RF  RFfill  evtId
0  1997-11-27 14:00:00  0.3     0.3    0
1  1997-11-27 15:00:00  1.1     1.1    0
2  1997-11-27 16:00:00  0.2     0.2    0
3  1997-11-27 17:00:00  0.0     0.0    0
4  1997-11-27 18:00:00  0.0     0.0    0
5  1997-11-27 19:00:00  1.1     1.1    0
6  1997-11-27 20:00:00  0.6     0.6    0
7  1997-11-27 21:00:00  0.0     0.0    NaN
8  1997-11-27 22:00:00  0.0     0.0    NaN
9  1997-11-27 23:00:00  0.0     0.0    NaN
10 1997-11-28 00:00:00  0.0     0.0    NaN
11 1997-11-28 01:00:00  0.0     0.0    NaN
12 1997-11-28 02:00:00  0.0     0.0    NaN
13 1997-11-28 03:00:00  0.0     0.0    NaN
14 1997-11-28 04:00:00  0.0     0.0    NaN
15 1997-11-28 05:00:00  0.0     0.0    NaN
16 1997-11-28 06:00:00  0.0     0.0    NaN
17 1997-11-28 07:00:00  0.0     0.0    NaN
18 1997-11-28 08:00:00  0.0     0.0    NaN
19 1997-11-28 09:00:00  0.8     0.8    1
20 1997-11-28 10:00:00  1.1     1.1    1
21 1997-11-28 11:00:00  2.3     2.3    1
22 1997-11-28 12:00:00  1.4     1.4    1
23 1997-11-28 13:00:00  0.4     0.4    1
24 1997-11-28 14:00:00  0.2     0.2    1
25 1997-11-28 15:00:00  0.0     0.0    NaN
26 1997-11-28 16:00:00  0.0     0.0    NaN
27 1997-11-28 17:00:00  0.0     0.0    NaN
28 1997-11-28 18:00:00  0.0     0.0    NaN
29 1997-11-28 19:00:00  0.0     0.0    NaN
30 1997-11-28 20:00:00  0.0     0.0    NaN

任何可以帮助我实现这一目标的想法?

1 个答案:

答案 0 :(得分:5)

import numpy as np
import pandas as pd
import scipy.ndimage as ndimage

df = pd.DataFrame({'RF': [ 0.3,  1.1,  0.2,  0. ,  0. ,  0. ,  0. ,  0. ,  
                           1.1,  0.6,  0. , 0. ,  0. ,  0. ,  0. ,  0. ,  
                           0.8,  1.1,  2.3,  1.4,  0.4,  0.2, 0. ,  0. ,  
                           0. ,  0. ,  0. ,  0. ]})

consecutive = 5
mask = df['RF'] > 0
df['mask'] = mask
df['dilation'] = ndimage.binary_dilation(mask, structure=[1]*(consecutive+1))
df['erosion'] = ndimage.binary_erosion(df['dilation'], 
    structure=[1]*(consecutive+1), border_value=1)
df['labeled'], nobjs = ndimage.label(df['erosion'])
df['evtId'] = np.where(df['labeled'] > 0, df['labeled']-1, np.nan)
print(df[['RF', 'evtId']])

产量

#      RF  evtId
# 0   0.3      0
# 1   1.1      0
# 2   0.2      0
# 3   0.0      0
# 4   0.0      0
# 5   0.0      0
# 6   0.0      0
# 7   0.0      0
# 8   1.1      0
# 9   0.6      0
# 10  0.0    NaN
# 11  0.0    NaN
# 12  0.0    NaN
# 13  0.0    NaN
# 14  0.0    NaN
# 15  0.0    NaN
# 16  0.8      1
# 17  1.1      1
# 18  2.3      1
# 19  1.4      1
# 20  0.4      1
# 21  0.2      1
# 22  0.0    NaN
# 23  0.0    NaN
# 24  0.0    NaN
# 25  0.0    NaN
# 26  0.0    NaN
# 27  0.0    NaN

解释:首先准备一个二进制掩码,其中df['RF'] > 0

mask = (df['RF'] > 0)
df['mask'] = mask
#      RF   mask
# 0   0.3   True
# 1   1.1   True
# 2   0.2   True
# 3   0.0  False
# 4   0.0  False
# 5   0.0  False
# 6   0.0  False
# 7   0.0  False
# 8   1.1   True
# 9   0.6   True
# ...

接下来,dilate掩码将True s(雨天)的岛屿连接在一起,分隔5个或更少False秒(非雨天):

df['dilation'] = ndimage.binary_dilation(mask, structure=[1]*(consecutive+1))
#      RF   mask dilation
# 0   0.3   True     True
# 1   1.1   True     True
# 2   0.2   True     True
# 3   0.0  False     True   <--, 
# 4   0.0  False     True      |
# 5   0.0  False     True      |  dilation filled over 5 rainy days
# 6   0.0  False     True      |
# 7   0.0  False     True   <--'
# 8   1.1   True     True
# 9   0.6   True     True
# 10  0.0  False     True   <-- But the `True`s extend a bit too far
# 11  0.0  False     True   <--
# 12  0.0  False    False
# 13  0.0  False     True
# 14  0.0  False     True
# 15  0.0  False     True
# 16  0.8   True     True
# 17  1.1   True     True
# 18  2.3   True     True
# 19  1.4   True     True
# 20  0.4   True     True
# 21  0.2   True     True
# 22  0.0  False     True
# 23  0.0  False     True
# 24  0.0  False    False
# 25  0.0  False    False
# 26  0.0  False    False
# 27  0.0  False    False

接下来使用binary erosion删除已扩展得太远的True

df['erosion'] = ndimage.binary_erosion(df['dilation'], structure=[1]*(consecutive+1), 
                                       border_value=1)
#      RF   mask dilation erosion
# 0   0.3   True     True    True
# 1   1.1   True     True    True
# 2   0.2   True     True    True
# 3   0.0  False     True    True
# 4   0.0  False     True    True
# 5   0.0  False     True    True
# 6   0.0  False     True    True
# 7   0.0  False     True    True
# 8   1.1   True     True    True
# 9   0.6   True     True    True
# 10  0.0  False     True   False  <--,
# 11  0.0  False     True   False     |
# 12  0.0  False    False   False     | The Falses have been expanded
# 13  0.0  False     True   False     | (The Trues eroded)
# 14  0.0  False     True   False     |
# 15  0.0  False     True   False  <--'
# 16  0.8   True     True    True
# 17  1.1   True     True    True
# 18  2.3   True     True    True
# 19  1.4   True     True    True
# 20  0.4   True     True    True
# 21  0.2   True     True    True
# 22  0.0  False     True   False
# 23  0.0  False     True   False
# 24  0.0  False    False   False
# 25  0.0  False    False   False
# 26  0.0  False    False   False
# 27  0.0  False    False   False

既然True代表&#34;降雨事件&#34;,我们可以使用ndimage.label为每个降雨事件分配一个唯一的数字:

df['labeled'], nobjs = ndimage.label(df['erosion'])
#      RF   mask dilation erosion  labeled
# 0   0.3   True     True    True        1
# 1   1.1   True     True    True        1
# 2   0.2   True     True    True        1
# 3   0.0  False     True    True        1
# 4   0.0  False     True    True        1
# 5   0.0  False     True    True        1
# 6   0.0  False     True    True        1
# 7   0.0  False     True    True        1
# 8   1.1   True     True    True        1
# 9   0.6   True     True    True        1
# 10  0.0  False     True   False        0
# 11  0.0  False     True   False        0
# 12  0.0  False    False   False        0
# 13  0.0  False     True   False        0
# 14  0.0  False     True   False        0
# 15  0.0  False     True   False        0
# 16  0.8   True     True    True        2
# 17  1.1   True     True    True        2
# 18  2.3   True     True    True        2
# 19  1.4   True     True    True        2
# 20  0.4   True     True    True        2
# 21  0.2   True     True    True        2
# 22  0.0  False     True   False        0
# 23  0.0  False     True   False        0
# 24  0.0  False    False   False        0
# 25  0.0  False    False   False        0
# 26  0.0  False    False   False        0
# 27  0.0  False    False   False        0

并使用np.wheredf['labeled'] > 0时将标签数量减少一个,否则分配np.nan

df['evtId'] = np.where(df['labeled'] > 0, df['labeled']-1, np.nan)
#      RF   mask dilation erosion  labeled  evtId
# 0   0.3   True     True    True        1      0
# 1   1.1   True     True    True        1      0
# 2   0.2   True     True    True        1      0
# 3   0.0  False     True    True        1      0
# 4   0.0  False     True    True        1      0
# 5   0.0  False     True    True        1      0
# 6   0.0  False     True    True        1      0
# 7   0.0  False     True    True        1      0
# 8   1.1   True     True    True        1      0
# 9   0.6   True     True    True        1      0
# 10  0.0  False     True   False        0    NaN
# 11  0.0  False     True   False        0    NaN
# 12  0.0  False    False   False        0    NaN
# 13  0.0  False     True   False        0    NaN
# 14  0.0  False     True   False        0    NaN
# 15  0.0  False     True   False        0    NaN
# 16  0.8   True     True    True        2      1
# 17  1.1   True     True    True        2      1
# 18  2.3   True     True    True        2      1
# 19  1.4   True     True    True        2      1
# 20  0.4   True     True    True        2      1
# 21  0.2   True     True    True        2      1
# 22  0.0  False     True   False        0    NaN
# 23  0.0  False     True   False        0    NaN
# 24  0.0  False    False   False        0    NaN
# 25  0.0  False    False   False        0    NaN
# 26  0.0  False    False   False        0    NaN
# 27  0.0  False    False   False        0    NaN

请注意,扩张后的侵蚀称为a closing。原因 为什么我使用ndimage.binary_dilationndimage.binary_erosion代替 只是调用ndimage.binary_closing是因为我需要设置 border_value=1以防止边缘被侵蚀。将df['erosion']

进行比较
ndimage.binary_closing(mask, structure=[1]*(consecutive+1))

你会看到差异。