我正在尝试对包含重复值的时间序列进行重新采样。我想重新采样时间序列,以每隔0.1秒包含一个时间点。对于新的时间点,我希望将NaN值插入这些创建的行中,并保持现有行不变。
import pandas as pd
import numpy as np
d1 = ({
'Value' : ['A','A',np.nan,np.nan,'B','B','B'],
'Other' : ['X','X',np.nan,np.nan,'X','X',np.nan],
'Col' : [np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.4','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.7','2019-08-02 09:50:10.8'],
'Count' : [1,1,np.nan,5,6,7,8],
})
df1 = pd.DataFrame(data = d1)
df1['Time'] = pd.to_datetime(df1['Time'])
df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
.unstack()
.asfreq('0.1S', method ='pad')
.stack()
.reset_index(level=1, drop=True)
.sort_index()
.reset_index())
输出:
Time Value Other Col Count
0 2019-08-02 09:50:10.100 A X NaN 1.0
1 2019-08-02 09:50:10.200 A X NaN 1.0
2 2019-08-02 09:50:10.300 A X NaN 1.0
3 2019-08-02 09:50:10.700 NaN NaN NaN 5.0
4 2019-08-02 09:50:10.700 B X NaN 6.0
5 2019-08-02 09:50:10.700 B X NaN 7.0
6 2019-08-02 09:50:10.800 B NaN NaN 8.0
预期输出:
Time Value Other Col Count
0 2019-08-02 09:50:10.100 A X NaN 1.0
1 2019-08-02 09:50:10.200 A X NaN 1.0
2 2019-08-02 09:50:10.300 NaN NaN NaN NaN
3 2019-08-02 09:50:10.400 NaN NaN NaN NaN
4 2019-08-02 09:50:10.500 NaN NaN NaN NaN
5 2019-08-02 09:50:10.600 NaN NaN NaN NaN
6 2019-08-02 09:50:10.700 NaN NaN NaN 5.0
7 2019-08-02 09:50:10.700 B X NaN 6.0
8 2019-08-02 09:50:10.700 B X NaN 7.0
9 2019-08-02 09:50:10.800 B NaN NaN 8.0
答案 0 :(得分:1)
尝试使用:
df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
.unstack()
.asfreq('100ms', method ='pad')
.stack()
.reset_index(level=1, drop=True)
.sort_index()
.reset_index())
dr = pd.date_range(df1['Time'].iloc[0], df1['Time'].iloc[-1], freq='100ms')
df2 = pd.DataFrame({'Time': dr[~dr.isin(df1['Time'])]}, columns = df1.columns)
print(pd.concat([df1,df2]).sort_values('Time').reset_index(drop=True))
输出:
Time Col Count Other Value
0 2019-08-02 09:50:10.100 NaN 1.0 X A
1 2019-08-02 09:50:10.200 NaN 1.0 X A
2 2019-08-02 09:50:10.300 NaN 1.0 X A
3 2019-08-02 09:50:10.400 NaN NaN NaN NaN
4 2019-08-02 09:50:10.500 NaN NaN NaN NaN
5 2019-08-02 09:50:10.600 NaN NaN NaN NaN
6 2019-08-02 09:50:10.700 NaN 5.0 NaN NaN
7 2019-08-02 09:50:10.700 NaN 6.0 X B
8 2019-08-02 09:50:10.700 NaN 7.0 X B
9 2019-08-02 09:50:10.800 NaN 8.0 NaN B
如您所见,我添加了代码的最后三行,我只创建了一个新的数据框df2
,该数据框将日期时间设置为不在df1
中,并将其余的列分配给NaN
,最后,我将两个数据帧连接起来,并按日期时间对其进行排序,然后重置索引,然后就可以了。
答案 1 :(得分:0)
问题是stack()
默认为dropna=True
。您可以更改它,然后在duplicated
之后执行另一个布尔掩码:
df1 = (df1.set_index(['Time', df1.groupby('Time').cumcount()])
.unstack()
.asfreq('0.1S', method ='pad')
.stack(dropna=False) #change dropna to False
.reset_index(level=1, drop=True)
.sort_index()
.reset_index())
print (df1[~df1.duplicated(["Value","Other","Col","Time","Count"], keep=False)|~df1['Time'].duplicated(keep='first')])
Time Value Other Col Count
0 2019-08-02 09:50:10.100 A X NaN 1.0
3 2019-08-02 09:50:10.200 A X NaN 1.0
6 2019-08-02 09:50:10.300 A X NaN 1.0
9 2019-08-02 09:50:10.400 NaN NaN NaN NaN
12 2019-08-02 09:50:10.500 NaN NaN NaN NaN
15 2019-08-02 09:50:10.600 NaN NaN NaN NaN
18 2019-08-02 09:50:10.700 NaN NaN NaN 5.0
19 2019-08-02 09:50:10.700 B X NaN 6.0
20 2019-08-02 09:50:10.700 B X NaN 7.0
21 2019-08-02 09:50:10.800 B NaN NaN 8.0