我有一个数据集,我想在每次达到零时分别为其分配一个唯一值。
我提出的代码似乎很慢,我怀疑必须有一个更快的方法。
import time
import pandas as pd
import numpy as np
#--------------------------------
# DEBUG TEST DATASET
#--------------------------------
#Create random test data
series_random = np.random.randint(low=1, high=10, size=(10000,1))
#Insert zeros at known points (this should result in six motion IDs)
series_random[[5,6,7,15,100,2000,5000]] = 0
#Create data frame from test series
df = pd.DataFrame(series_random, columns=['Speed'])
#--------------------------------
#Elaped time counter
Elapsed_ms = time.time()
#Set Motion ID variable
Motion_ID = 0
#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0
#Iterate through each row of df
for i in range(df.index.min()+1, df.index.max()+1):
#Set Motion ID to latest value
df.loc[i, 'Motion ID'] = Motion_ID
#If previous speed was zero and current speed is >0, then new motion detected
if df.loc[i-1, 'Speed'] == 0 and df.loc[i, 'Speed'] > 0:
Motion_ID += 1
df.loc[i, 'Motion ID'] = Motion_ID
#Include first zero value in new Motion ID (for plotting purposes)
df.loc[i-1, 'Motion ID'] = Motion_ID
Elapsed_ms = int((time.time() - Elapsed_ms) * 1000)
print('Result: {} records checked, {} unique trips identified in {} ms'.format(len(df.index),df['Motion ID'].nunique(),Elapsed_ms))
以上代码的输出为:
结果:检查了10000条记录,在6879 ms内确定了6次独特的行程
我的实际数据集会更大,所以即使在这个小例子中,我也感到惊讶,它花费了这么长时间似乎只是一个简单的操作。
答案 0 :(得分:0)
您可以使用numpy中的布尔数组和表达式来表达逻辑,而无需任何循环:
def get_motion_id(speed):
mask = np.zeros(speed.size, dtype=bool)
# mask[i] == True if Speed[i - 1] == 0 and Speed[i] > 0
mask[1:] = speed[:-1] == 0
mask &= speed > 0
# Taking the cumsum increases the motion_id by one where mask is True
motion_id = mask.astype(int).cumsum()
# Carry over beginning of a motion to the preceding step with Speed == 0
motion_id[:-1] = motion_id[1:]
return motion_id
# small demo example
df = pd.DataFrame({'Speed': [3, 0, 1, 2, 0, 1]})
df['Motion_ID'] = get_motion_id(df['Speed'])
print(df)
Speed Motion_ID
0 3 0
1 0 1
2 1 1
3 2 1
4 0 2
5 1 2
对于您的10,000行示例,我看到速度提高了约800:
%time df['Motion_ID'] = get_motion_id(df['Speed'])
CPU times: user 5.26 ms, sys: 3.18 ms, total: 8.43 ms
Wall time: 8.01 ms
答案 1 :(得分:0)
另一种实现方法是从df
中提取索引值0,然后遍历这些索引值进行检查并分配Motion Id
的值。检查以下代码:
Motion_ID = 0
#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0
i=0
for index_val in sorted(df[df['Speed'] == 0].index):
df.loc[i:index_val,'Motion ID'] = Motion_ID
i = index_val
if df.loc[index_val+1, 'Speed'] > 0:
Motion_ID += 1
df.loc[i:df.index.max(),'Motion ID'] = Motion_ID+1
#Iterate through each row of df
输出:
Result: 10000 records checked, 6 unique trips identified in 49 ms