这与之前的问题有关:Python pandas change duplicate timestamp to unique,因此这个名称相似。
附加要求是每秒处理多个重复项并在第二个边界之间均匀分配,即
....
2011/1/4 9:14:00
2011/1/4 9:14:00
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:14:01
2011/1/4 9:15:02
2011/1/4 9:15:02
2011/1/4 9:15:02
2011/1/4 9:15:03
....
应该成为
....
2011/1/4 9:14:00
2011/1/4 9:14:00.500
2011/1/4 9:14:01
2011/1/4 9:14:01.200
2011/1/4 9:14:01.400
2011/1/4 9:14:01.600
2011/1/4 9:14:01.800
2011/1/4 9:14:02
2011/1/4 9:14:02.333
2011/1/4 9:14:02.666
2011/1/4 9:14:03
....
我很难理解如何处理可变数量的重复项。
我按照groupby()
的方式思考,但无法做到正确。我认为这是一个已经解决的常用用例,因此非常感谢任何帮助。
答案 0 :(得分:1)
我将datetime列转换为timedelta[ms]
。但问题是数字太大,所以首先我将年份转换为epoch time - 2011 - 1970
。然后我计算了差异,这些差异已添加到第一列:df['one'] = df['one'] - df['new'] + df['timedelta'].
然后以毫秒为单位的timedeltas作为整数转换为timedeltas,最后一次加上年2011 - 1970
。
# time
#0 2011-01-04 09:14:00
#1 2011-01-04 09:14:00
#2 2011-01-04 09:14:01
#3 2011-01-04 09:14:01
#4 2011-01-04 09:14:01
#5 2011-01-04 09:14:01
#6 2011-01-04 09:14:01
#7 2011-01-04 09:15:02
#8 2011-01-04 09:15:02
#9 2011-01-04 09:15:02
#10 2011-01-04 09:15:03
#time datetime64[ns]
#remove years for less timedeltas
df['time1'] = df['time'].apply(lambda x: x - pd.DateOffset(years=2011-1970))
#convert time to timedeltas in miliseconds
df['timedelta'] = pd.to_timedelta(df['time1']) / np.timedelta64(1, 'ms')
df['one'] = 1
#count differences by groupby and transforming mean/sum
m = lambda x: (x.mean()) / x.sum()
df['one'] = df.groupby('time')['one'].transform(m)
#cast float to integer
df['new'] = (df['one']*1000).astype(int)
#need differences by cumulative sum
df['one'] = df.groupby('time')['new'].transform(np.cumsum)
#column cumulatice sum substracting differences and added timedelta
df['one'] = df['one'] - df['new'] + df['timedelta']
#convert integer to timedelta
df['final'] = pd.to_timedelta(df['one'],unit='ms')
#add removed years
df['final'] = df['final'].apply(lambda x: pd.to_datetime(x) + pd.DateOffset(years=2011-1970))
#remove unnecessary columns
df = df.drop(['time1', 'timedelta', 'one', 'new'], axis=1)
print df
# time final
#0 2011-01-04 09:14:00 2011-01-04 09:14:00.000
#1 2011-01-04 09:14:00 2011-01-04 09:14:00.500
#2 2011-01-04 09:14:01 2011-01-04 09:14:01.000
#3 2011-01-04 09:14:01 2011-01-04 09:14:01.200
#4 2011-01-04 09:14:01 2011-01-04 09:14:01.400
#5 2011-01-04 09:14:01 2011-01-04 09:14:01.600
#6 2011-01-04 09:14:01 2011-01-04 09:14:01.800
#7 2011-01-04 09:15:02 2011-01-04 09:15:02.000
#8 2011-01-04 09:15:02 2011-01-04 09:15:02.333
#9 2011-01-04 09:15:02 2011-01-04 09:15:02.666
#10 2011-01-04 09:15:03 2011-01-04 09:15:03.000