这是对上一个问题(Patch over missing rows in CSV file in Python)
的一点跟进之前给出的答案在数据正常增加时效果很好,但有些新数据看起来更像这样:
time,a,b,c,d,e
"2016-12-28 00:00:01","4","2","2","8","0"
"2016-12-28 00:01:01","4","2","2","8","0"
"2016-12-28 00:02:01","4","2","2","8","0"
"2016-12-28 00:03:01","4","2","2","8","0"
"2016-12-28 00:05:01","4","2","2","8","0"
"2016-12-28 00:06:01","4","2","2","8","0"
"2016-12-28 00:07:01","4","2","1","7","1"
"2016-12-28 00:08:01","4","2","2","7","0"
"2016-12-28 00:09:02","4","2","1","7","1"
"2016-12-28 00:10:02","4","2","2","7","0"
"2016-12-28 00:11:02","3","2","1","7","1"
"2016-12-28 00:12:05","3","2","1","7","1"
"2016-12-28 00:13:02","3","2","1","7","1"
"2016-12-28 00:18:02","3","2","2","7","0"
"2016-12-28 00:19:05","3","2","1","7","1"
"2016-12-28 00:20:02","3","2","2","7","0"
"2016-12-28 00:21:02","3","2","1","7","1"
"2016-12-28 00:22:03","3","2","1","7","1"
"2016-12-28 00:23:02","3","2","1","7","1"
"2016-12-28 00:24:02","3","2","1","7","1"
"2016-12-28 00:25:05","3","2","1","7","1"
"2016-12-28 00:26:02","3","2","1","7","1"
"2016-12-28 00:27:03","3","2","1","7","1"
"2016-12-28 00:28:01","3","2","1","7","1"
"2016-12-28 00:29:02","3","2","1","7","1"
"2016-12-28 00:30:06","3","2","1","7","1"
使用之前给出的解决方案会产生此结果 - 当索引不完美匹配时,它会清除现有数据。
import pandas as pd
import io
import datetime
z = pd.read_csv('smalldata.csv')
z = z[~z.time.duplicated()]
z['time'] = pd.to_datetime(z['time'])
x = z.set_index('time').reindex(pd.date_range(min(z['time']),max(z['time']),freq="1min"))
print(x)
输出
2016-12-28 00:00:01 4.0 2.0 2.0 8.0 0.0
2016-12-28 00:01:01 4.0 2.0 2.0 8.0 0.0
2016-12-28 00:02:01 4.0 2.0 2.0 8.0 0.0
2016-12-28 00:03:01 4.0 2.0 2.0 8.0 0.0
2016-12-28 00:04:01 NaN NaN NaN NaN NaN
2016-12-28 00:05:01 4.0 2.0 2.0 8.0 0.0
2016-12-28 00:06:01 4.0 2.0 2.0 8.0 0.0
2016-12-28 00:07:01 4.0 2.0 1.0 7.0 1.0
2016-12-28 00:08:01 4.0 2.0 2.0 7.0 0.0
2016-12-28 00:09:01 NaN NaN NaN NaN NaN
2016-12-28 00:10:01 NaN NaN NaN NaN NaN
2016-12-28 00:11:01 NaN NaN NaN NaN NaN
2016-12-28 00:12:01 NaN NaN NaN NaN NaN
2016-12-28 00:13:01 NaN NaN NaN NaN NaN
2016-12-28 00:14:01 NaN NaN NaN NaN NaN
2016-12-28 00:15:01 NaN NaN NaN NaN NaN
2016-12-28 00:16:01 NaN NaN NaN NaN NaN
2016-12-28 00:17:01 NaN NaN NaN NaN NaN
2016-12-28 00:18:01 NaN NaN NaN NaN NaN
2016-12-28 00:19:01 NaN NaN NaN NaN NaN
2016-12-28 00:20:01 NaN NaN NaN NaN NaN
2016-12-28 00:21:01 NaN NaN NaN NaN NaN
2016-12-28 00:22:01 NaN NaN NaN NaN NaN
2016-12-28 00:23:01 NaN NaN NaN NaN NaN
2016-12-28 00:24:01 NaN NaN NaN NaN NaN
2016-12-28 00:25:01 NaN NaN NaN NaN NaN
2016-12-28 00:26:01 NaN NaN NaN NaN NaN
2016-12-28 00:27:01 NaN NaN NaN NaN NaN
2016-12-28 00:28:01 NaN NaN NaN NaN NaN
2016-12-28 00:29:01 NaN NaN NaN NaN NaN
2016-12-28 00:30:01 NaN NaN NaN NaN NaN
我可以ffill()这个,但我只是永远地将最后一个有效值复制下来,而不是填写我真正想要的缺失分钟。
我已经尝试将reindex freq设置为覆盖时间戳的秒部分的范围,但没有成功。我也查看了容差选项,但是找不到任何关于它的文档,我的实验并没有证明是有用的。
当索引没有单调递增时,有没有重新索引数据帧?
答案 0 :(得分:1)
解决方案1
使用asfreq
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
df.asfreq('min', 'nearest')
a b c d e
time
2016-12-28 00:00:01 4 2 2 8 0
2016-12-28 00:01:01 4 2 2 8 0
2016-12-28 00:02:01 4 2 2 8 0
2016-12-28 00:03:01 4 2 2 8 0
2016-12-28 00:04:01 4 2 2 8 0
2016-12-28 00:05:01 4 2 2 8 0
2016-12-28 00:06:01 4 2 2 8 0
2016-12-28 00:07:01 4 2 1 7 1
2016-12-28 00:08:01 4 2 2 7 0
2016-12-28 00:09:01 4 2 1 7 1
2016-12-28 00:10:01 4 2 2 7 0
2016-12-28 00:11:01 3 2 1 7 1
2016-12-28 00:12:01 3 2 1 7 1
2016-12-28 00:13:01 3 2 1 7 1
2016-12-28 00:14:01 3 2 1 7 1
2016-12-28 00:15:01 3 2 1 7 1
2016-12-28 00:16:01 3 2 2 7 0
2016-12-28 00:17:01 3 2 2 7 0
2016-12-28 00:18:01 3 2 2 7 0
2016-12-28 00:19:01 3 2 1 7 1
2016-12-28 00:20:01 3 2 2 7 0
2016-12-28 00:21:01 3 2 1 7 1
2016-12-28 00:22:01 3 2 1 7 1
2016-12-28 00:23:01 3 2 1 7 1
2016-12-28 00:24:01 3 2 1 7 1
2016-12-28 00:25:01 3 2 1 7 1
2016-12-28 00:26:01 3 2 1 7 1
2016-12-28 00:27:01 3 2 1 7 1
2016-12-28 00:28:01 3 2 1 7 1
2016-12-28 00:29:01 3 2 1 7 1
2016-12-28 00:30:01 3 2 1 7 1
解决方案2
将reindex
与'nearest'
和tolerance
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
tidx = pd.date_range(df.index.min(), df.index.max(), freq='min')
df.reindex(tidx, method='nearest', tolerance='5min')
a b c d e
2016-12-28 00:00:01 4 2 2 8 0
2016-12-28 00:01:01 4 2 2 8 0
2016-12-28 00:02:01 4 2 2 8 0
2016-12-28 00:03:01 4 2 2 8 0
2016-12-28 00:04:01 4 2 2 8 0
2016-12-28 00:05:01 4 2 2 8 0
2016-12-28 00:06:01 4 2 2 8 0
2016-12-28 00:07:01 4 2 1 7 1
2016-12-28 00:08:01 4 2 2 7 0
2016-12-28 00:09:01 4 2 1 7 1
2016-12-28 00:10:01 4 2 2 7 0
2016-12-28 00:11:01 3 2 1 7 1
2016-12-28 00:12:01 3 2 1 7 1
2016-12-28 00:13:01 3 2 1 7 1
2016-12-28 00:14:01 3 2 1 7 1
2016-12-28 00:15:01 3 2 1 7 1
2016-12-28 00:16:01 3 2 2 7 0
2016-12-28 00:17:01 3 2 2 7 0
2016-12-28 00:18:01 3 2 2 7 0
2016-12-28 00:19:01 3 2 1 7 1
2016-12-28 00:20:01 3 2 2 7 0
2016-12-28 00:21:01 3 2 1 7 1
2016-12-28 00:22:01 3 2 1 7 1
2016-12-28 00:23:01 3 2 1 7 1
2016-12-28 00:24:01 3 2 1 7 1
2016-12-28 00:25:01 3 2 1 7 1
2016-12-28 00:26:01 3 2 1 7 1
2016-12-28 00:27:01 3 2 1 7 1
2016-12-28 00:28:01 3 2 1 7 1
2016-12-28 00:29:01 3 2 1 7 1
2016-12-28 00:30:01 3 2 1 7 1