使用重复值对时间序列重新采样

时间:2019-10-28 03:51:13

标签: python pandas

我正在尝试对包含重复值的时间序列进行重新采样。我想重新采样时间序列,以每隔0.1秒包含一个时间点。对于新的时间点,我希望将NaN值插入这些创建的行中,并保持现有行不变。

import pandas as pd
import numpy as np

d1 = ({   
    'Value' : ['A','A',np.nan,np.nan,'B','B','B'],
    'Other' : ['X','X',np.nan,np.nan,'X','X',np.nan],  
    'Col' : [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],                          
    'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.4','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.8'],
    'Count' : [1,1,np.nan,5,6,7,8],
    })

df1 = pd.DataFrame(data = d1)

df1['Time'] = pd.to_datetime(df1['Time'])

df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
        .unstack()
        .asfreq('0.1S', method ='pad')
        .stack()
        .reset_index(level=1, drop=True)
        .sort_index()
        .reset_index())

输出:

                     Time Value Other  Col  Count
0 2019-08-02 09:50:10.100     A     X  NaN    1.0
1 2019-08-02 09:50:10.200     A     X  NaN    1.0
2 2019-08-02 09:50:10.300     A     X  NaN    1.0
3 2019-08-02 09:50:10.700   NaN   NaN  NaN    5.0
4 2019-08-02 09:50:10.700     B     X  NaN    6.0
5 2019-08-02 09:50:10.700     B     X  NaN    7.0
6 2019-08-02 09:50:10.800     B   NaN  NaN    8.0

预期输出:

                     Time Value Other    Col  Count
0 2019-08-02 09:50:10.100     A     X    NaN    1.0
1 2019-08-02 09:50:10.200     A     X    NaN    1.0
2 2019-08-02 09:50:10.300     NaN   NaN  NaN    NaN
3 2019-08-02 09:50:10.400     NaN   NaN  NaN    NaN
4 2019-08-02 09:50:10.500     NaN   NaN  NaN    NaN
5 2019-08-02 09:50:10.600     NaN   NaN  NaN    NaN
6 2019-08-02 09:50:10.700     NaN   NaN  NaN    5.0
7 2019-08-02 09:50:10.700     B     X    NaN    6.0
8 2019-08-02 09:50:10.700     B     X    NaN    7.0
9 2019-08-02 09:50:10.800     B     NaN  NaN    8.0

2 个答案:

答案 0 :(得分:1)

尝试使用:

df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
        .unstack()
        .asfreq('100ms', method ='pad')
        .stack()
        .reset_index(level=1, drop=True)
        .sort_index()
        .reset_index())
dr = pd.date_range(df1['Time'].iloc[0], df1['Time'].iloc[-1], freq='100ms')
df2 = pd.DataFrame({'Time': dr[~dr.isin(df1['Time'])]}, columns = df1.columns)
print(pd.concat([df1,df2]).sort_values('Time').reset_index(drop=True))

输出:

                     Time  Col  Count Other Value
0 2019-08-02 09:50:10.100  NaN    1.0     X     A
1 2019-08-02 09:50:10.200  NaN    1.0     X     A
2 2019-08-02 09:50:10.300  NaN    1.0     X     A
3 2019-08-02 09:50:10.400  NaN    NaN   NaN   NaN
4 2019-08-02 09:50:10.500  NaN    NaN   NaN   NaN
5 2019-08-02 09:50:10.600  NaN    NaN   NaN   NaN
6 2019-08-02 09:50:10.700  NaN    5.0   NaN   NaN
7 2019-08-02 09:50:10.700  NaN    6.0     X     B
8 2019-08-02 09:50:10.700  NaN    7.0     X     B
9 2019-08-02 09:50:10.800  NaN    8.0   NaN     B

如您所见,我添加了代码的最后三行,我只创建了一个新的数据框df2,该数据框将日期时间设置为不在df1中,并将其余的列分配给NaN,最后,我将两个数据帧连接起来,并按日期时间对其进行排序,然后重置索引,然后就可以了。

答案 1 :(得分:0)

问题是stack()默认为dropna=True。您可以更改它,然后在duplicated之后执行另一个布尔掩码:

df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
        .unstack()
        .asfreq('0.1S', method ='pad')
        .stack(dropna=False) #change dropna to False
        .reset_index(level=1, drop=True)
        .sort_index()
        .reset_index())

print (df1[~df1.duplicated(["Value","Other","Col","Time","Count"], keep=False)|~df1['Time'].duplicated(keep='first')])

                      Time Value Other  Col  Count
0  2019-08-02 09:50:10.100     A     X  NaN    1.0
3  2019-08-02 09:50:10.200     A     X  NaN    1.0
6  2019-08-02 09:50:10.300     A     X  NaN    1.0
9  2019-08-02 09:50:10.400   NaN   NaN  NaN    NaN
12 2019-08-02 09:50:10.500   NaN   NaN  NaN    NaN
15 2019-08-02 09:50:10.600   NaN   NaN  NaN    NaN
18 2019-08-02 09:50:10.700   NaN   NaN  NaN    5.0
19 2019-08-02 09:50:10.700     B     X  NaN    6.0
20 2019-08-02 09:50:10.700     B     X  NaN    7.0
21 2019-08-02 09:50:10.800     B   NaN  NaN    8.0