Python pandas将可变数量的重复时间戳更改为唯一

时间:2015-11-04 17:35:00

标签: python pandas duplicates time-series

这与之前的问题有关:Python pandas change duplicate timestamp to unique,因此这个名称相似。

附加要求是每秒处理多个重复项并在第二个边界之间均匀分配,即

....
2011/1/4    9:14:00
2011/1/4    9:14:00
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:14:01
2011/1/4    9:15:02
2011/1/4    9:15:02
2011/1/4    9:15:02
2011/1/4    9:15:03
....

应该成为

....
2011/1/4    9:14:00
2011/1/4    9:14:00.500
2011/1/4    9:14:01
2011/1/4    9:14:01.200
2011/1/4    9:14:01.400
2011/1/4    9:14:01.600
2011/1/4    9:14:01.800
2011/1/4    9:14:02
2011/1/4    9:14:02.333
2011/1/4    9:14:02.666
2011/1/4    9:14:03
....

我很难理解如何处理可变数量的重复项。

我按照groupby()的方式思考,但无法做到正确。我认为这是一个已经解决的常用用例,因此非常感谢任何帮助。

1 个答案:

答案 0 :(得分:1)

我将datetime列转换为timedelta[ms]。但问题是数字太大,所以首先我将年份转换为epoch time - 2011 - 1970。然后我计算了差异,这些差异已添加到第一列:df['one'] = df['one'] - df['new'] + df['timedelta'].然后以毫秒为单位的timedeltas作为整数转换为timedeltas,最后一次加上年2011 - 1970

#                 time
#0  2011-01-04 09:14:00
#1  2011-01-04 09:14:00
#2  2011-01-04 09:14:01
#3  2011-01-04 09:14:01
#4  2011-01-04 09:14:01
#5  2011-01-04 09:14:01
#6  2011-01-04 09:14:01
#7  2011-01-04 09:15:02
#8  2011-01-04 09:15:02
#9  2011-01-04 09:15:02
#10 2011-01-04 09:15:03
#time    datetime64[ns]

#remove years for less timedeltas
df['time1'] = df['time'].apply(lambda x: x - pd.DateOffset(years=2011-1970))
#convert time to timedeltas in miliseconds
df['timedelta'] = pd.to_timedelta(df['time1']) / np.timedelta64(1, 'ms')
df['one'] = 1
#count differences by groupby and transforming mean/sum
m = lambda x: (x.mean()) / x.sum()
df['one'] = df.groupby('time')['one'].transform(m)
#cast float to integer
df['new'] = (df['one']*1000).astype(int)
#need differences by cumulative sum
df['one'] = df.groupby('time')['new'].transform(np.cumsum)
#column cumulatice sum substracting differences and added timedelta
df['one'] = df['one'] - df['new'] + df['timedelta']
#convert integer to timedelta
df['final'] = pd.to_timedelta(df['one'],unit='ms')
#add removed years
df['final'] = df['final'].apply(lambda x: pd.to_datetime(x) + pd.DateOffset(years=2011-1970))
#remove unnecessary columns
df = df.drop(['time1', 'timedelta', 'one', 'new'], axis=1)
print df
#                  time                   final
#0  2011-01-04 09:14:00 2011-01-04 09:14:00.000
#1  2011-01-04 09:14:00 2011-01-04 09:14:00.500
#2  2011-01-04 09:14:01 2011-01-04 09:14:01.000
#3  2011-01-04 09:14:01 2011-01-04 09:14:01.200
#4  2011-01-04 09:14:01 2011-01-04 09:14:01.400
#5  2011-01-04 09:14:01 2011-01-04 09:14:01.600
#6  2011-01-04 09:14:01 2011-01-04 09:14:01.800
#7  2011-01-04 09:15:02 2011-01-04 09:15:02.000
#8  2011-01-04 09:15:02 2011-01-04 09:15:02.333
#9  2011-01-04 09:15:02 2011-01-04 09:15:02.666
#10 2011-01-04 09:15:03 2011-01-04 09:15:03.000