我正在处理一个问题声明,其中有两个数据帧df1和df_main。
df_main如下:
users = ['id1','id1','id2','id2','id3','id3','id4']
keywords = ['k1','k1', 'k2','k2','k2','k3','k3']
quantity = [10,10,2,2,2,4,4]
duration = [1,1,3,3,3,2,2]
df_main = pd.DataFrame(list(zip(users, keywords, quantity, duration)), columns = ['users','keywords','quantity','duration'])
df_main基本上是一个包含user_id信息,其相应的关键字以及qty和duration列的数据框
df1的user_id有一栏,而df_main中的所有关键字的其余栏。以main_df为参考,每个user_id和关键字对都标记为1,否则保持为0。
这是df1的代码:
columns = ['USER_ID','k1','k2','k3']
users = ['id1','id2','id3','id4']
values1 = [1,0,0,0]
values2 = [0,1,1,0]
values3 = [0,0,1,1]
df1 = pd.DataFrame(list(zip(users, values1, values2, values3)), columns = columns)
现在,我需要以下数据框:
total_quantity和total_duration是每个ID和关键字对的数量和持续时间值的总和。
我尝试了以下代码:
all_keywords = df1.columns.tolist()[1:]
data_list = []
for keyword in all_keywords:
ID_selected = df1[df1[keyword] == 1]['ID'].values.tolist()
indexes = df1[df1[keyword] == 1].index.tolist()
qty_list = [0] * len(df1)
duration_list = [0] * len(df1)
all_qty = []
all_duration = []
for id in ID_selected:
all_qty.append(np.sum(df_main[(df_main['users'] == id) & (df_main['keywords'] == keyword)]['quantity'].values.tolist()))
all_duration.append(np.sum(df_main[(df_main['users'] == umi) & (df_main['keyword'] == meds)]['DaysSupply'].values.tolist()))
for index, qty, duration in zip(indexes, all_qty, all_duration):
qty_list[index] = qty
duration_list[index] = duration
d_temp = pd.DataFrame(list(zip(qty_list, duration_list)), columns = [keyword+'qty', keyword+'duration'])
data_list.append(d_temp)
result = pd.concat(data_list)
代码正在工作,但是它确实很慢,我真的想摆脱循环。如果有人可以向我展示一种更优化的方法,我将不胜感激。
答案 0 :(得分:1)
代码性能的主要问题是多个循环。您可以使用pandas内置方法将所有循环委派给numpy的C实现。
例如,对总和,整形和展平索引使用df1
,然后与df_temp = df_main.groupby(['users', 'keywords']).sum().unstack()
df_temp.columns = 'total_' + df_temp.columns.map('_'.join) # flatten column index
df1 = df1.merge(df_temp, left_on='USER_ID', right_on='users')
合并。
total_quantity_k1 total_quantity_k2 total_quantity_k3 \
users
id1 20.0 NaN NaN
id2 NaN 4.0 NaN
id3 NaN 2.0 4.0
id4 NaN NaN 4.0
total_duration_k1 total_duration_k2 total_duration_k3
users
id1 2.0 NaN NaN
id2 NaN 6.0 NaN
id3 NaN 3.0 2.0
id4 NaN NaN 2.0
USER_ID k1 k2 k3 total_quantity_k1 total_quantity_k2 \
0 id1 1 0 0 20.0 NaN
1 id2 0 1 0 NaN 4.0
2 id3 0 1 1 NaN 2.0
3 id4 0 0 1 NaN NaN
total_quantity_k3 total_duration_k1 total_duration_k2 total_duration_k3
0 NaN 2.0 NaN NaN
1 NaN NaN 6.0 NaN
2 4.0 NaN 3.0 2.0
3 4.0 NaN NaN 2.0
输出
{{1}}