我有一个如下所示的数据框:
user_id event_name event_params
10 start /pseudo
10 subcategory /home
10 add_basket_click /click
10 add_basket_error /event
10 end /end
11 start /pseudo
11 add_basket_click /click
11 add_basket_error /event
我想将行交换为event_name,add_basket_click和add_basket_error。目前add_basket_error在add_basket_click之后,我希望它在它之前。输出应如下所示。实际的数据集有1000万行,因此需要寻找Python答案。
user_id event_name event_params
10 start /pseudo
10 subcategory /home
10 add_basket_error /event
10 add_basket_click /click
10 end /end
11 start /pseudo
11 add_basket_error /event
11 add_basket_click /click
答案 0 :(得分:3)
这里可以按预期顺序在event_name
中创建所有可能值的列表,然后将列转换为ordered categoricals,因此可以使用DataFrame.sort_values
按2列进行排序:
L = ['start','subcategory','add_basket_error','add_basket_click','end']
df['event_name'] = pd.Categorical(df['event_name'], ordered=True, categories=L)
df = df.sort_values(['user_id','event_name'])
print (df)
user_id event_name event_params
0 10 start /pseudo
1 10 subcategory /home
3 10 add_basket_error /event
2 10 add_basket_click /click
4 10 end /end
5 11 start /pseudo
7 11 add_basket_error /event
6 11 add_basket_click /click
编辑:
#added separate row 1 - not changed in output
print (df)
user_id event_name event_params
0 10 start /pseudo
1 10 add_basket_error /event
2 10 subcategory /home
3 10 add_basket_click /click
4 10 add_basket_error /event
5 10 end /end
6 11 start /pseudo
7 11 add_basket_click /click
8 11 add_basket_error /event
您可以比较Series.eq
和Series.shift
个值,最后分配回交换的行:
m11 = df['event_name'].eq('add_basket_click')
m12 = df['event_name'].shift(-1).eq('add_basket_error')
m21 = df['event_name'].eq('add_basket_error')
m22 = df['event_name'].shift().eq('add_basket_click')
df[m21 & m22], df[m11 & m12] = df[m11 & m12].values, df[m21 & m22].values
print (df)
user_id event_name event_params
0 10 start /pseudo
1 10 add_basket_error /event
2 10 subcategory /home
3 10 add_basket_error /event
4 10 add_basket_click /click
5 10 end /end
6 11 start /pseudo
7 11 add_basket_error /event
8 11 add_basket_click /click
答案 1 :(得分:3)
Here is one potential solution, using boolean indexing
and loc
:
# Boolean series of event_name containing 'add_basket_error'
s = df.event_name.str.contains('add_basket_error')
# Create 2 frames, errors and events from boolean index 's'
errors, events = (df.loc[s[s].index].copy(), df.loc[s[s].index - 1].copy())
# Swap event and error values
df.loc[s[s].index] = events.values
df.loc[s[s].index - 1] = errors.values
print(df)
[output]
user_id event_name event_params
0 10 start /pseudo
1 10 subcategory /home
2 10 add_basket_error /event
3 10 add_basket_click /click
4 10 end /end
5 11 start /pseudo
6 11 add_basket_error /event
7 11 add_basket_click /click
答案 2 :(得分:0)
我做了以下工作,
df['scounter'] = df.groupby('user_id').cumcount()+1
#
df1 = df[df.event_name == 'Add_Basket_Error']
df = df[df.event_name != 'Add_Basket_Error']
#
df1['scounter'] = df1['scounter'] - 1.1
#
df = df.append(df1, ignore_index=True)
#
df.sort_values(['user_id', 'scounter'], ascending = [True, True], inplace=True)
df = df.reset_index(drop=True)