在Pandas中重新索引数据框,并对新/旧指标进行容差

时间:2017-02-12 13:29:48

标签: python pandas

这是对上一个问题(Patch over missing rows in CSV file in Python

的一点跟进

之前给出的答案在数据正常增加时效果很好,但有些新数据看起来更像这样:

time,a,b,c,d,e
"2016-12-28 00:00:01","4","2","2","8","0"
"2016-12-28 00:01:01","4","2","2","8","0"
"2016-12-28 00:02:01","4","2","2","8","0"
"2016-12-28 00:03:01","4","2","2","8","0"
"2016-12-28 00:05:01","4","2","2","8","0"
"2016-12-28 00:06:01","4","2","2","8","0"
"2016-12-28 00:07:01","4","2","1","7","1"
"2016-12-28 00:08:01","4","2","2","7","0"
"2016-12-28 00:09:02","4","2","1","7","1"
"2016-12-28 00:10:02","4","2","2","7","0"
"2016-12-28 00:11:02","3","2","1","7","1"
"2016-12-28 00:12:05","3","2","1","7","1"
"2016-12-28 00:13:02","3","2","1","7","1"
"2016-12-28 00:18:02","3","2","2","7","0"
"2016-12-28 00:19:05","3","2","1","7","1"
"2016-12-28 00:20:02","3","2","2","7","0"
"2016-12-28 00:21:02","3","2","1","7","1"
"2016-12-28 00:22:03","3","2","1","7","1"
"2016-12-28 00:23:02","3","2","1","7","1"
"2016-12-28 00:24:02","3","2","1","7","1"
"2016-12-28 00:25:05","3","2","1","7","1"
"2016-12-28 00:26:02","3","2","1","7","1"
"2016-12-28 00:27:03","3","2","1","7","1"
"2016-12-28 00:28:01","3","2","1","7","1"
"2016-12-28 00:29:02","3","2","1","7","1"
"2016-12-28 00:30:06","3","2","1","7","1"

使用之前给出的解决方案会产生此结果 - 当索引不完美匹配时,它会清除现有数据。

import pandas as pd
import io
import datetime
z = pd.read_csv('smalldata.csv')
z = z[~z.time.duplicated()]
z['time'] = pd.to_datetime(z['time'])
x = z.set_index('time').reindex(pd.date_range(min(z['time']),max(z['time']),freq="1min"))

print(x)

输出

2016-12-28 00:00:01  4.0  2.0  2.0  8.0  0.0
2016-12-28 00:01:01  4.0  2.0  2.0  8.0  0.0
2016-12-28 00:02:01  4.0  2.0  2.0  8.0  0.0
2016-12-28 00:03:01  4.0  2.0  2.0  8.0  0.0
2016-12-28 00:04:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:05:01  4.0  2.0  2.0  8.0  0.0
2016-12-28 00:06:01  4.0  2.0  2.0  8.0  0.0
2016-12-28 00:07:01  4.0  2.0  1.0  7.0  1.0
2016-12-28 00:08:01  4.0  2.0  2.0  7.0  0.0
2016-12-28 00:09:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:10:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:11:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:12:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:13:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:14:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:15:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:16:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:17:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:18:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:19:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:20:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:21:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:22:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:23:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:24:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:25:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:26:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:27:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:28:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:29:01  NaN  NaN  NaN  NaN  NaN
2016-12-28 00:30:01  NaN  NaN  NaN  NaN  NaN

我可以ffill()这个,但我只是永远地将最后一个有效值复制下来,而不是填写我真正想要的缺失分钟。

我已经尝试将reindex freq设置为覆盖时间戳的秒部分的范围,但没有成功。我也查看了容差选项,但是找不到任何关于它的文档,我的实验并没有证明是有用的。

当索引没有单调递增时,有没有重新索引数据帧?

1 个答案:

答案 0 :(得分:1)

解决方案1 ​​
使用asfreq

df.time = pd.to_datetime(df.time)
df = df.set_index('time')
df.asfreq('min', 'nearest')

                     a  b  c  d  e
time                              
2016-12-28 00:00:01  4  2  2  8  0
2016-12-28 00:01:01  4  2  2  8  0
2016-12-28 00:02:01  4  2  2  8  0
2016-12-28 00:03:01  4  2  2  8  0
2016-12-28 00:04:01  4  2  2  8  0
2016-12-28 00:05:01  4  2  2  8  0
2016-12-28 00:06:01  4  2  2  8  0
2016-12-28 00:07:01  4  2  1  7  1
2016-12-28 00:08:01  4  2  2  7  0
2016-12-28 00:09:01  4  2  1  7  1
2016-12-28 00:10:01  4  2  2  7  0
2016-12-28 00:11:01  3  2  1  7  1
2016-12-28 00:12:01  3  2  1  7  1
2016-12-28 00:13:01  3  2  1  7  1
2016-12-28 00:14:01  3  2  1  7  1
2016-12-28 00:15:01  3  2  1  7  1
2016-12-28 00:16:01  3  2  2  7  0
2016-12-28 00:17:01  3  2  2  7  0
2016-12-28 00:18:01  3  2  2  7  0
2016-12-28 00:19:01  3  2  1  7  1
2016-12-28 00:20:01  3  2  2  7  0
2016-12-28 00:21:01  3  2  1  7  1
2016-12-28 00:22:01  3  2  1  7  1
2016-12-28 00:23:01  3  2  1  7  1
2016-12-28 00:24:01  3  2  1  7  1
2016-12-28 00:25:01  3  2  1  7  1
2016-12-28 00:26:01  3  2  1  7  1
2016-12-28 00:27:01  3  2  1  7  1
2016-12-28 00:28:01  3  2  1  7  1
2016-12-28 00:29:01  3  2  1  7  1
2016-12-28 00:30:01  3  2  1  7  1

解决方案2
reindex'nearest'tolerance

一起使用
df.time = pd.to_datetime(df.time)
df = df.set_index('time')
tidx = pd.date_range(df.index.min(), df.index.max(), freq='min')
df.reindex(tidx, method='nearest', tolerance='5min')

                     a  b  c  d  e
2016-12-28 00:00:01  4  2  2  8  0
2016-12-28 00:01:01  4  2  2  8  0
2016-12-28 00:02:01  4  2  2  8  0
2016-12-28 00:03:01  4  2  2  8  0
2016-12-28 00:04:01  4  2  2  8  0
2016-12-28 00:05:01  4  2  2  8  0
2016-12-28 00:06:01  4  2  2  8  0
2016-12-28 00:07:01  4  2  1  7  1
2016-12-28 00:08:01  4  2  2  7  0
2016-12-28 00:09:01  4  2  1  7  1
2016-12-28 00:10:01  4  2  2  7  0
2016-12-28 00:11:01  3  2  1  7  1
2016-12-28 00:12:01  3  2  1  7  1
2016-12-28 00:13:01  3  2  1  7  1
2016-12-28 00:14:01  3  2  1  7  1
2016-12-28 00:15:01  3  2  1  7  1
2016-12-28 00:16:01  3  2  2  7  0
2016-12-28 00:17:01  3  2  2  7  0
2016-12-28 00:18:01  3  2  2  7  0
2016-12-28 00:19:01  3  2  1  7  1
2016-12-28 00:20:01  3  2  2  7  0
2016-12-28 00:21:01  3  2  1  7  1
2016-12-28 00:22:01  3  2  1  7  1
2016-12-28 00:23:01  3  2  1  7  1
2016-12-28 00:24:01  3  2  1  7  1
2016-12-28 00:25:01  3  2  1  7  1
2016-12-28 00:26:01  3  2  1  7  1
2016-12-28 00:27:01  3  2  1  7  1
2016-12-28 00:28:01  3  2  1  7  1
2016-12-28 00:29:01  3  2  1  7  1
2016-12-28 00:30:01  3  2  1  7  1