修改pandas中的时间戳以使索引唯一

时间:2017-04-08 16:54:41

标签: python pandas

我正在处理不定期记录的财务数据。有些时间戳是重复的,这使分析变得棘手。这是数据的一个示例 - 请注意有四个2016-08-23 00:00:17.664193时间戳:

In [167]: ts
Out[168]: 
                               last  last_sz      bid      ask
datetime                                                      
2016-08-23 00:00:14.161128  2170.75        1  2170.75  2171.00
2016-08-23 00:00:14.901180  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.196639  2170.75        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        2  2170.75  2171.00
2016-08-23 00:00:17.664193  2171.00        1  2170.75  2171.00
2016-08-23 00:00:26.206108  2170.75        2  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        7  2170.75  2171.00
2016-08-23 00:00:28.322456  2170.75        1  2170.75  2171.00

在这个例子中,只有几个重复,但在某些情况下,有数百个连续的行,都共享相同的时间戳。我的目标是通过为每个副本添加1个额外纳秒来解决这个问题(因此,在连续4个相同时间戳的情况下,我将向第二个添加1ns,向第3个添加2ns,向第四个添加3ns。例如,上述数据将转换为:

In [169]: make_timestamps_unique(ts)
Out[170]:
                                  last  last_sz      bid     ask
newindex                                                        
2016-08-23 00:00:14.161128000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:14.901180000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.196639000  2170.75        1  2170.75  2171.0
2016-08-23 00:00:17.664193000  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193001  2171.00        1  2170.75  2171.0
2016-08-23 00:00:17.664193002  2171.00        2  2170.75  2171.0
2016-08-23 00:00:17.664193003  2171.00        1  2170.75  2171.0
2016-08-23 00:00:26.206108000  2170.75        2  2170.75  2171.0
2016-08-23 00:00:28.322456000  2170.75        7  2170.75  2171.0
2016-08-23 00:00:28.322456001  2170.75        1  2170.75  2171.0

我一直在努力寻找一种好方法 - 我目前的解决方案是多次通过,每次检查重复,并在一系列相同的时间戳中添加1ns,但第一次除外。这是代码:

def make_timestamps_unique(ts):
    mask = ts.index.duplicated('first')
    duplicate_count = np.sum(mask)
    passes = 0

    while duplicate_count > 0:
        ts.loc[:, 'newindex'] = ts.index
        ts.loc[mask, 'newindex'] += pd.Timedelta('1ns')
        ts = ts.set_index('newindex')
        mask = ts.index.duplicated('first')
        duplicate_count = np.sum(mask)
        passes += 1

    print('%d passes of duplication loop' % passes)
    return ts

这显然效率很低 - 它通常需要数百次传递,如果我在200万行数据帧上尝试,我会得到MemoryError。有什么想法可以更好地实现这个目标吗?

3 个答案:

答案 0 :(得分:6)

这是一个更快的numpy版本(但可读性稍差),受到SO article的启发。我们的想法是在重复的时间戳值上使用cumsum,同时在每次遇到np.NaN时重置累积总和:

# get duplicated values as float and replace 0 with NaN
values = df.index.duplicated(keep=False).astype(float)
values[values==0] = np.NaN

missings = np.isnan(values)
cumsum = np.cumsum(~missings)
diff = np.diff(np.concatenate(([0.], cumsum[missings])))
values[missings] = -diff

# print result
result = df.index + np.cumsum(values).astype(np.timedelta64)
print(result)

DatetimeIndex([   '2016-08-23 00:00:14.161128',
                  '2016-08-23 00:00:14.901180',
                  '2016-08-23 00:00:17.196639',
               '2016-08-23 00:00:17.664193001',
               '2016-08-23 00:00:17.664193002',
               '2016-08-23 00:00:17.664193003',
               '2016-08-23 00:00:17.664193004',
                  '2016-08-23 00:00:26.206108',
               '2016-08-23 00:00:28.322456001',
               '2016-08-23 00:00:28.322456002'],
              dtype='datetime64[ns]', name='datetime', freq=None)

此解决方案的时间产生10000 loops, best of 3: 107 µs per loop,而使用100 loops, best of 3: 5.3 ms per loop的虚拟数据的@DYZ groupby / apply方法(但更具可读性)大约慢50倍。

当然,您必须重置索引,最后:

df.index = result

答案 1 :(得分:5)

您可以按索引对行进行分组,然后将一系列连续的timedeltas添加到每个组的索引中。我不确定这是否可以直接使用索引完成,但您可以先将索引转换为普通列,将操作应用于列,然后再将列设置为索引:

newindex = ts.reset_index()\
             .groupby('datetime')['datetime']\
             .apply(lambda x: x + np.arange(x.size).astype(np.timedelta64))
df.index = newindex

答案 2 :(得分:1)

让我们从矢量化基准测试开始,因为你正在处理1M +行,这应该是一个优先事项:

%timeit do
10000000 loops, best of 3: 20.5 ns per loop

让我们提供一些测试数据,因为没有提供:

rng = pd.date_range('1/1/2011', periods=72, freq='H')

df = pd.DataFrame(dict(time = rng))

复制时间戳:

df =pd.concat((df, df))
df =df.sort()

df
Out [296]:
                  time
0  2011-01-01 00:00:00
0  2011-01-01 00:00:00
1  2011-01-01 01:00:00
1  2011-01-01 01:00:00
2  2011-01-01 02:00:00
2  2011-01-01 02:00:00
3  2011-01-01 03:00:00
3  2011-01-01 03:00:00
4  2011-01-01 04:00:00
4  2011-01-01 04:00:00
5  2011-01-01 05:00:00
5  2011-01-01 05:00:00
6  2011-01-01 06:00:00
6  2011-01-01 06:00:00
7  2011-01-01 07:00:00
7  2011-01-01 07:00:00
8  2011-01-01 08:00:00
8  2011-01-01 08:00:00
9  2011-01-01 09:00:00
9  2011-01-01 09:00:00

找出与上一行的时差与0秒

的位置
mask = (df.time-df.time.shift()) == np.timedelta64(0,'s')

mask
Out [307]:
0     False
0      True
1     False
1      True
2     False
2      True
3     False
3      True
4     False
4      True
5     False

抵消这些位置:在这种情况下,我选择了毫秒

df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5))

Out [309]:
                      time
0  2011-01-01 00:00:00.000
0  2011-01-01 00:00:00.005
1  2011-01-01 01:00:00.000
1  2011-01-01 01:00:00.005
2  2011-01-01 02:00:00.000
2  2011-01-01 02:00:00.005
3  2011-01-01 03:00:00.000
3  2011-01-01 03:00:00.005
4  2011-01-01 04:00:00.000
4  2011-01-01 04:00:00.005
5  2011-01-01 05:00:00.000

编辑:连续时间戳[假设为4]

consect = 4 
for i in range(4):
    mask = (df.time-df.time.shift(consect)) == np.timedelta64(0,'s')
    df.loc[mask,'time'] = df.time[mask].apply(lambda x: x+pd.offsets.Milli(5+i))
    consect -= 1