我写了一个可以正常工作的代码。如下所示:我需要优化运行时。
for i in range(len(df)):
try:
if df['event_name'][i] in ['add_basket_click','remove_basket_click'] and df['event_name'][i-1]=='product_search':
try:
if df['event_desc'][i]['firebase_screen_id']==df['event_desc'][i-1]['firebase_screen_id']:
df.at[i,'search_process']=1
except:
pass
except:
pass
以下是样本数据集:
user_id event_name event_desc
10 product_search {'firebase_previous_id': '8996730796507124997'}
10 add_basket_click {'firebase_previous_id': '8996730796507124997'}
10 start {'firebase_previous_id': '8996730796507124997'}
10 add_basket_click {'firebase_previous_id': '8996730796507124997'}
输出:
user_id event_name event_desc search_process
10 product_search {'firebase_previous_id': '8996730796507124997'} 0
10 add_basket_click {'firebase_previous_id': '8996730796507124997'} 1
10 start {'firebase_previous_id': '8996730796507124997'} 0
10 add_basket_click {'firebase_previous_id': '8996730796507124997'} 0
答案 0 :(得分:3)
我相信您需要在firebase_previous_id
列的字典中测试firebase_screen_id
而不是event_desc
:
m1 = df['event_name'].shift() =='product_search'
m2 = df['event_name'].isin(['add_basket_click','remove_basket_click'])
#changed values for non matched values after get
s1 = df['event_desc'].apply(lambda x: x.get('firebase_previous_id', 'not_m'))
s2 = df['event_desc'].apply(lambda x: x.get('firebase_previous_id', 'not_matched'))
m3 = s1 == s2.shift()
df['search_process'] = (m1 & m2 & m3).astype(int)
print (df)
user_id event_name event_desc \
0 10 product_search {'firebase_previous_id': '8996730796507124997'}
1 10 add_basket_click {'firebase_previous_id': '8996730796507124997'}
2 10 start {'firebase_previous_id': '8996730796507124997'}
3 10 add_basket_click {'firebase_previous_id': '8996730796507124997'}
search_process
0 0
1 1
2 0
3 0
答案 1 :(得分:2)
尝试使用Processes
软件包将数据处理划分为多个multiprocessing
(最好与您的PC拥有的内核数相匹配)。