我发现的最相似的问题是here,但没有正确的答案。
基本上,我遇到一个问题,试图在数据帧上使用groupby来生成公交路线的唯一ID。问题是,我可以随时使用的数据(尽管很少)在groupby列中具有相同的值,因此即使它们不是同一总线,它们也被视为同一条总线。
我唯一想到的另一种方法是根据称为“停止类型”的另一列对总线进行分组,其中有一个指示“开始”,“中间”和“结束”的指示器。我想使用groupby根据此列创建组,其中每个组从“停止类型” =开始的地方开始,到“停止类型” =结束的地方结束。
请考虑以下数据:
df = pd.DataFrame({'Vehicle_ID': ['A']*18,
'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3)})
Cond Position
0 A START
1 A MID
2 A MID
3 A END
4 A MID
5 A START
6 A START
7 A MID
8 A MID
9 A END
10 A MID
11 A START
12 A START
13 A MID
14 A MID
15 A END
16 A MID
17 A START
我想出的将这些总线准确地分组在一起的唯一方法是生成带有总线序列ID的附加列,但是鉴于我要处理大量数据,所以这不是一个非常有效的解决方案。我希望能够通过一个groupby来管理我想做的事情,以便生成以下输出
Cond Position Group
0 A START 1
1 A MID 1
2 A MID 1
3 A END 1
4 A MID
5 A START 2
6 A START 2
7 A MID 2
8 A MID 2
9 A END 2
10 A MID
11 A START 3
12 A START 3
13 A MID 3
14 A MID 3
15 A END 3
16 A MID
17 A START 4
答案 0 :(得分:1)
一个想法是通过np.select
进行分解,然后通过numba
使用自定义循环:
from numba import njit
df = pd.DataFrame({'Vehicle_ID': ['A']*18,
'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})
@njit
def grouper(pos):
res = np.empty(pos.shape)
num = 1
started = 0
for i in range(len(res)):
current_pos = pos[i]
if (started == 0) and (current_pos == 0):
started = 1
res[i] = num
elif (started == 1) and (current_pos == 1):
started = 0
res[i] = num
num += 1
elif (started == 1) and (current_pos in [-1, 0]):
res[i] = num
else:
res[i] = 0
return res
arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
df['Group'] = grouper(arr).astype(int)
结果:
print(df)
Position Vehicle_ID Group
0 START A 1
1 MID A 1
2 MID A 1
3 END A 1
4 MID A 0
5 START A 2
6 START A 2
7 MID A 2
8 MID A 2
9 END A 2
10 MID A 0
11 START A 3
12 START A 3
13 MID A 3
14 MID A 3
15 END A 3
16 MID A 0
17 START A 4
我认为,您应该不包括“空白”值,因为这将迫使您的系列为object
dtype,对于任何后续处理均效率低下。如上所述,您可以改用0
。
numba
比一种纯熊猫方法快约10倍:-
import pandas as pd, numpy as np
from numba import njit
df = pd.DataFrame({'Vehicle_ID': ['A']*18,
'Position': ['START', 'MID', 'MID', 'END', 'MID', 'START']*3})
df = pd.concat([df]*10, ignore_index=True)
assert joz(df.copy()).equals(jpp(df.copy()))
%timeit joz(df.copy()) # 18.6 ms per loop
%timeit jpp(df.copy()) # 1.95 ms per loop
基准功能:
def joz(df):
# identification of sequences
df['Position_Prev'] = df['Position'].shift(1)
df['Sequence'] = 0
df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
df.loc[df['Position'] == 'END', 'Sequence'] = -1
df['Sequence_Sum'] = df['Sequence'].cumsum()
df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1
# take only items between START and END and generate Group number
df2 = df[df['Sequence_Sum'] == 1].copy()
df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
df2['Group'] = df2['Sequence'].cumsum()
# merge results to one dataframe
df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
df['Group'] = df['Group'].fillna(0)
df['Group'] = df['Group'].astype(int)
df.drop(['Position_Prev', 'Sequence', 'Sequence_Sum'], axis=1, inplace=True)
return df
@njit
def grouper(pos):
res = np.empty(pos.shape)
num = 1
started = 0
for i in range(len(res)):
current_pos = pos[i]
if (started == 0) and (current_pos == 0):
started = 1
res[i] = num
elif (started == 1) and (current_pos == 1):
started = 0
res[i] = num
num += 1
elif (started == 1) and (current_pos in [-1, 0]):
res[i] = num
else:
res[i] = 0
return res
def jpp(df):
arr = np.select([df['Position'].eq('START'), df['Position'].eq('END')], [0, 1], -1)
df['Group'] = grouper(arr).astype(int)
return df
答案 1 :(得分:1)
我有一些解决方案。您必须避免循环,并尝试使用滑动,切片和合并。
这是我的第一个原型(应该重构)
# identification of sequences
df['Position_Prev'] = df['Position'].shift(1)
df['Sequence'] = 0
df.loc[(df['Position'] == 'START') & (df['Position_Prev'] != 'START'), 'Sequence'] = 1
df.loc[df['Position'] == 'END', 'Sequence'] = -1
df['Sequence_Sum'] = df['Sequence'].cumsum()
df.loc[df['Sequence'] == -1, 'Sequence_Sum'] = 1
# take only items between START and END and generate Group number
df2 = df[df['Sequence_Sum'] == 1].copy()
df2.loc[df['Sequence'] == -1, 'Sequence'] = 0
df2['Group'] = df2['Sequence'].cumsum()
# merge results to one dataframe
df = df.merge(df2[['Group']], left_index=True, right_index=True, how='left')
df['Group'] = df['Group'].fillna(0)
df['Group'] = df['Group'].astype(int)
df.drop(columns=['Position_Prev', 'Sequence', 'Sequence_Sum'], inplace=True)
df
结果:
Vehicle_ID Position Group
0 A START 1
1 A MID 1
2 A MID 1
3 A END 1
4 A MID 0
5 A START 2
6 A START 2
7 A MID 2
8 A MID 2
9 A END 2
10 A MID 0
11 A START 3
12 A START 3
13 A MID 3
14 A MID 3
15 A END 3
16 A MID 0
17 A START 4