我在python中有数据列表,如下表所示。
基本上,它是通过观察我们的机器人在我们的迷宫/竞技场中所做的事情而产生的。我们有事件的时间戳,目前时间戳是事件驱动的而不是周期性的。
我需要以有效的方式找到在每个舞台上度过的时间。
TimeStamp Arena
101 Arena A
109 Arena A
112 Arena B
113 Arena A
118 Arena A
120 Arena D
125 Arena D
129 Arena D
138 Arena B
139 Arena B
148 Arena C
149 Arena C
150 Arena B
151 Arena B
159 Arena D
169 Arena D
171 Arena D
172 Arena D
175 Arena B
177 Arena B
180 Arena B
181 Arena A
182 Arena A
189 Arena E
200 Arena E
204 Arena E
208 Arena A
209 Arena A
基本上,我需要在下面得到这个。在每个舞台上花费的总时间。
Arena TimeStamp
Arena D 32
Arena B 23
Arena E 22
Arena A 16
Arena C 10
我写了一个简单的脚本,现在正在执行此操作。
import pandas as pd
data = pd.read_csv('arenas_visited.csv')
l = len(data[[1]])
first_arena = data.loc[0, 'Arena']
start_time = data.loc[0, 'TimeStamp']
summary = []
for i in range(0,l):
try:
next_arena = data.loc[i+1, 'Arena']
except:
break
first_arena = data.loc[i, 'Arena']
if first_arena != next_arena:
change_time = data.loc[i, 'TimeStamp']
time_spent = change_time - start_time
arena = str(data.loc[i, 'Arena'])
summary.append([arena, time_spent])
start_time = change_time
first_arena = data.loc[i+1, 'Arena']
if i == l-2:
if data.loc[i, 'Arena'] != data.loc[i+1, 'Arena']:
time_spent = 1
arena = str(data.loc[i+1, 'Arena'])
print (str(1) + " Spent in " + arena)
summary.append([arena, time_spent])
else:
pass
aggregated = pd.DataFrame(summary, columns = ['Arena', 'TimeStamp'])
time_per_arena = aggregated.groupby(['Arena']).sum().sort_values('TimeStamp', ascending=False).reset_index()
print time_per_arena
基本上,虽然这个工作得很好。但是,我最终会有数百万行这些数据,我需要找到一种更快的方法。
但是,除了遍历每一行之外,我还没有看到其他任何方法吗?
我不在考虑的事情吗?
答案 0 :(得分:2)
创建时间增量的向量,然后对其进行分组和求和:
df['delta'] = df.TimeStamp - df.TimeStamp.shift()
df.groupby('Arena').delta.sum()
Out[62]:
Arena
Arena_A 21.0
Arena_B 23.0
Arena_C 10.0
Arena_D 32.0
Arena_E 22.0
Name: delta, dtype: float64
答案 1 :(得分:0)
Python有一堆好东西,其他语言不会自动构建。您可以在以下情况下自行索引数据:
result = {}
old_arena = None
old_timestamp = 0
# I don't have a lot of experience with panda, so you may need to massage the
# input to be able to do this
for line in data:
timestamp, _, arena = line.split()
if arena == old_arena:
continue
timestamp = int(timestamp)
try:
result[old_arena] += timestamp - old_timestamp
except:
result[old_arena] = timestamp - old_timestamp
old_arena = arena
old_timestamp = timestamp
# Process the last interval - if the last one was changed, then
# old_timestamp will equal timestamp and this is fine
result[old_arena] += int(timestamp) - old_timestamp
这将使用O(n)
时间和时间一次处理整个列表。 O(n+k)
空间复杂度,其中k是竞技场的数量。
包含dict的结果(其中None表示初始时间偏移量):
{'A': 27, 'C': 2, 'B': 26, 'E': 19, 'D': 34, None: 101}
对于您的示例数据:值得注意的是,这会转换为old_arena,这可能不是您想要的。
如果你想在转换到下一个竞技场的地方进行,那么通过反转我们的遍历来进行次要编辑:
result = {}
old_arena = None
old_timestamp = 0
# I don't have a lot of experience with panda, so you may need to massage the
# input to be able to do this
for line in reversed(data):
timestamp, _, arena = line.split()
if arena == old_arena:
continue
timestamp = int(timestamp)
try:
result[old_arena] += old_timestamp - timestamp
except:
result[old_arena] = old_timestamp - timestamp
old_arena = arena
old_timestamp = timestamp
# Process the last interval - if the last one was changed, then
# old_timestamp will equal timestamp and this is fine
result[old_arena] += old_timestamp - int(timestamp)
给出了:
{'A': 21, 'C': 10, 'B': 23, 'E': 22, 'D': 32, None: -209}