Question

我有一个数据集，我想在每次达到零时分别为其分配一个唯一值。

我提出的代码似乎很慢，我怀疑必须有一个更快的方法。

import time
import pandas as pd
import numpy as np

#--------------------------------
#     DEBUG TEST DATASET
#--------------------------------
#Create random test data
series_random = np.random.randint(low=1, high=10, size=(10000,1))

#Insert zeros at known points (this should result in six motion IDs)
series_random[[5,6,7,15,100,2000,5000]] = 0

#Create data frame from test series
df = pd.DataFrame(series_random, columns=['Speed'])
#--------------------------------

#Elaped time counter
Elapsed_ms = time.time()

#Set Motion ID variable
Motion_ID = 0

#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0

#Iterate through each row of df
for i in range(df.index.min()+1, df.index.max()+1):

    #Set Motion ID to latest value
    df.loc[i, 'Motion ID'] = Motion_ID

    #If previous speed was zero and current speed is >0, then new motion detected        
    if df.loc[i-1, 'Speed'] == 0 and df.loc[i, 'Speed'] > 0:
        Motion_ID += 1
        df.loc[i, 'Motion ID'] = Motion_ID

        #Include first zero value in new Motion ID (for plotting purposes)
        df.loc[i-1, 'Motion ID'] = Motion_ID

Elapsed_ms = int((time.time() - Elapsed_ms) * 1000)

print('Result: {} records checked, {} unique trips identified in {} ms'.format(len(df.index),df['Motion ID'].nunique(),Elapsed_ms))

以上代码的输出为：

结果：检查了10000条记录，在6879 ms内确定了6次独特的行程

我的实际数据集会更大，所以即使在这个小例子中，我也感到惊讶，它花费了这么长时间似乎只是一个简单的操作。

Answer 1

您可以使用numpy中的布尔数组和表达式来表达逻辑，而无需任何循环：

def get_motion_id(speed):
    mask = np.zeros(speed.size, dtype=bool)

    # mask[i] == True if Speed[i - 1] == 0 and Speed[i] > 0
    mask[1:] = speed[:-1] == 0
    mask &= speed > 0

    # Taking the cumsum increases the motion_id by one where mask is True
    motion_id = mask.astype(int).cumsum()
    # Carry over beginning of a motion to the preceding step with Speed == 0
    motion_id[:-1] = motion_id[1:]
    return motion_id


# small demo example
df = pd.DataFrame({'Speed': [3, 0, 1, 2, 0, 1]})
df['Motion_ID'] = get_motion_id(df['Speed'])
print(df)
   Speed  Motion_ID
0      3          0
1      0          1
2      1          1
3      2          1
4      0          2
5      1          2

对于您的10,000行示例，我看到速度提高了约800：

%time df['Motion_ID'] = get_motion_id(df['Speed'])
CPU times: user 5.26 ms, sys: 3.18 ms, total: 8.43 ms
Wall time: 8.01 ms

Answer 2

另一种实现方法是从df中提取索引值0，然后遍历这些索引值进行检查并分配Motion Id的值。检查以下代码：

Motion_ID = 0

#Create series with Motion IDs
df.loc[:,'Motion ID'] = 0
i=0
for index_val in sorted(df[df['Speed'] == 0].index):
    df.loc[i:index_val,'Motion ID'] = Motion_ID
    i = index_val
    if df.loc[index_val+1, 'Speed'] > 0:
        Motion_ID += 1

df.loc[i:df.index.max(),'Motion ID'] = Motion_ID+1
#Iterate through each row of df

输出：

Result: 10000 records checked, 6 unique trips identified in 49 ms

索引上一行或下一行是否可以避免数据帧行循环？

2 个答案: