Question

我正在使用眼动追踪数据集，出于某种原因，在列 df [＆＃39; timestamp＆＃39;] 超过数据帧中的1,000,000个值之后将其四舍五入到接下来是100.这是有问题的，因为眼动追踪器大约每增加20个就存储一个新的数据点。

我设法找到一个适合我的解决方案，但我想知道是否有更优雅的矢量化方法？

# create a variable that tracks the difference in time
df['dt'] = (df['timestamp'] - df['timestamp'].shift(1))

# I want to keep the old timestamps, so I make a new column
df['new_timestamp'] = df['timestamp']

for i in range(1,6):
df['new_timestamp'] = np.where(df['dt'] == 0,
                              df['new_timestamp'] + 20,
                              df['new_timestamp'])
df['dt'] = (df['new_timestamp'] - df['new_timestamp'].shift(1))

编辑：

更确切地说，某些值具有如下模式：

Current      Corrected    
5113100.0    5113100.0
5113100.0    5113120.0
5113100.0    5113140.0
5113100.0    5113160.0
5113100.0    5113180.0
5113200.0    5113200.0

Answer 1

您可以使用.diff()方法为您带来不同之处（只是更干净，而不是更快）。然后，您可以选择差值为0的所有行，并为它们添加20。

df['new_timestamp'] = df['timestamp']
occurrences = df.timestamp.groupby((df.timestamp != df.timestamp.shift()).cumsum()).cumcount()
df.loc[df['timestamp'].diff() == 0, 'new_timestamp'] += 20 * occurrences

修改

我编辑了代码以考虑多次连续出现。诀窍是计算连续0的数量，并将此数字加20倍。第二行很棘手，但在post

中有很好的解释

一个例子：

>>>      timestamp   occurences   new_timestamp
443        9860          0             9860
444        9880          0             9880
445        9880          1             9900
446        9880          2             9920
447        9880          3             9940
448        9960          0             9960
449        9980          0             9980

具有固定模式的pandas中的矢量化修改时间戳列

1 个答案: