我正在处理不定期记录的财务数据。有些时间戳是重复的,这使分析变得棘手。这是数据的一个示例 - 请注意有四个2016-08-23 00:00:17.664193
时间戳:
In [167]: ts
Out[168]:
last last_sz bid ask
datetime
2016-08-23 00:00:14.161128 2170.75 1 2170.75 2171.00
2016-08-23 00:00:14.901180 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.196639 2170.75 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 2 2170.75 2171.00
2016-08-23 00:00:17.664193 2171.00 1 2170.75 2171.00
2016-08-23 00:00:26.206108 2170.75 2 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 7 2170.75 2171.00
2016-08-23 00:00:28.322456 2170.75 1 2170.75 2171.00
在这个例子中,只有几个重复,但在某些情况下,有数百个连续的行,都共享相同的时间戳。我的目标是通过为每个副本添加1个额外纳秒来解决这个问题(因此,在连续4个相同时间戳的情况下,我将向第二个添加1ns,向第3个添加2ns,向第四个添加3ns。例如,上述数据将转换为:
In [169]: make_timestamps_unique(ts)
Out[170]:
last last_sz bid ask
newindex
2016-08-23 00:00:14.161128000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:14.901180000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.196639000 2170.75 1 2170.75 2171.0
2016-08-23 00:00:17.664193000 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193001 2171.00 1 2170.75 2171.0
2016-08-23 00:00:17.664193002 2171.00 2 2170.75 2171.0
2016-08-23 00:00:17.664193003 2171.00 1 2170.75 2171.0
2016-08-23 00:00:26.206108000 2170.75 2 2170.75 2171.0
2016-08-23 00:00:28.322456000 2170.75 7 2170.75 2171.0
2016-08-23 00:00:28.322456001 2170.75 1 2170.75 2171.0
我一直在努力寻找一种好方法 - 我目前的解决方案是多次通过,每次检查重复,并在一系列相同的时间戳中添加1ns,但第一次除外。这是代码:
def make_timestamps_unique(ts):
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes = 0
while duplicate_count > 0:
ts.loc[:, 'newindex'] = ts.index
ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
ts = ts.set_index('newindex')
mask = ts.index.duplicated('first')
duplicate_count = np.sum(mask)
passes += 1
print('%d passes of duplication loop' % passes)
return ts
这显然效率很低 - 它通常需要数百次传递,如果我在200万行数据帧上尝试,我会得到MemoryError
。有什么想法可以更好地实现这个目标吗?
答案 0 :(得分:6)
这是一个更快的numpy版本(但可读性稍差),受到SO article的启发。我们的想法是在重复的时间戳值上使用cumsum
,同时在每次遇到np.NaN
时重置累积总和:
# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN
missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff
# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)
DatetimeIndex([ '2016-08-23 00:00:14.161128',
'2016-08-23 00:00:14.901180',
'2016-08-23 00:00:17.196639',
'2016-08-23 00:00:17.664193001',
'2016-08-23 00:00:17.664193002',
'2016-08-23 00:00:17.664193003',
'2016-08-23 00:00:17.664193004',
'2016-08-23 00:00:26.206108',
'2016-08-23 00:00:28.322456001',
'2016-08-23 00:00:28.322456002'],
dtype='datetime64[ns]', name='datetime', freq=None)
此解决方案的时间产生10000 loops, best of 3: 107 µs per loop
,而使用100 loops, best of 3: 5.3 ms per loop
的虚拟数据的@DYZ groupby / apply方法(但更具可读性)大约慢50倍。
当然,您必须重置索引,最后:
df.index = result
答案 1 :(得分:5)
您可以按索引对行进行分组,然后将一系列连续的timedeltas添加到每个组的索引中。我不确定这是否可以直接使用索引完成,但您可以先将索引转换为普通列,将操作应用于列,然后再将列设置为索引:
newindex = ts.reset_index()\
.groupby('datetime')['datetime']\
.apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex
答案 2 :(得分:1)
让我们从矢量化基准测试开始,因为你正在处理1M +行,这应该是一个优先事项:
%timeit do
10000000 loops, best of 3: 20.5 ns per loop
让我们提供一些测试数据,因为没有提供:
rng = pd.date_range('1/1/2011', periods=72, freq='H')
df = pd.DataFrame(dict(time = rng))
复制时间戳:
df =pd.concat((df, df))
df =df.sort()
df
Out [296]:
time
0 2011-01-01 00:00:00
0 2011-01-01 00:00:00
1 2011-01-01 01:00:00
1 2011-01-01 01:00:00
2 2011-01-01 02:00:00
2 2011-01-01 02:00:00
3 2011-01-01 03:00:00
3 2011-01-01 03:00:00
4 2011-01-01 04:00:00
4 2011-01-01 04:00:00
5 2011-01-01 05:00:00
5 2011-01-01 05:00:00
6 2011-01-01 06:00:00
6 2011-01-01 06:00:00
7 2011-01-01 07:00:00
7 2011-01-01 07:00:00
8 2011-01-01 08:00:00
8 2011-01-01 08:00:00
9 2011-01-01 09:00:00
9 2011-01-01 09:00:00
找出与上一行的时差与0秒
的位置mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')
mask
Out [307]:
0 False
0 True
1 False
1 True
2 False
2 True
3 False
3 True
4 False
4 True
5 False
抵消这些位置:在这种情况下,我选择了毫秒
df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5))
Out [309]:
time
0 2011-01-01 00:00:00.000
0 2011-01-01 00:00:00.005
1 2011-01-01 01:00:00.000
1 2011-01-01 01:00:00.005
2 2011-01-01 02:00:00.000
2 2011-01-01 02:00:00.005
3 2011-01-01 03:00:00.000
3 2011-01-01 03:00:00.005
4 2011-01-01 04:00:00.000
4 2011-01-01 04:00:00.005
5 2011-01-01 05:00:00.000
编辑:连续时间戳[假设为4]
consect = 4
for i in range(4):
mask = (df.time-df.time.shift(consect)) == np.timedelta64(0,'s')
df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5+i))
consect -= 1