搜索优化

时间:2020-09-09 15:46:57

标签: python pandas dataframe

我正在处理一个问题声明,其中有两个数据帧df1和df_main。

df_main如下:

users = ['id1','id1','id2','id2','id3','id3','id4']
keywords = ['k1','k1', 'k2','k2','k2','k3','k3']
quantity = [10,10,2,2,2,4,4]
duration  = [1,1,3,3,3,2,2]

df_main = pd.DataFrame(list(zip(users, keywords, quantity, duration)), columns = ['users','keywords','quantity','duration'])

df_main基本上是一个包含user_id信息,其相应的关键字以及qty和duration列的数据框

df1的user_id有一栏,而df_main中的所有关键字的其余栏。以main_df为参考,每个user_id和关键字对都标记为1,否则保持为0。

这是df1的代码:

columns = ['USER_ID','k1','k2','k3']
users = ['id1','id2','id3','id4']
values1 = [1,0,0,0]
values2 = [0,1,1,0]
values3 = [0,0,1,1]




df1 = pd.DataFrame(list(zip(users, values1, values2, values3)), columns = columns)

现在,我需要以下数据框:

the documentation

total_quantity和total_duration是每个ID和关键字对的数量和持续时间值的总和。

我尝试了以下代码:

    all_keywords = df1.columns.tolist()[1:]
    data_list = []
    for keyword in all_keywords:
        ID_selected = df1[df1[keyword] == 1]['ID'].values.tolist()
        indexes = df1[df1[keyword] == 1].index.tolist()

        qty_list = [0] * len(df1)
        duration_list = [0] * len(df1)
        all_qty = []
        all_duration = []
        for id in ID_selected:
            all_qty.append(np.sum(df_main[(df_main['users'] == id) & (df_main['keywords'] == keyword)]['quantity'].values.tolist()))
            all_duration.append(np.sum(df_main[(df_main['users'] == umi) & (df_main['keyword'] == meds)]['DaysSupply'].values.tolist()))

        for index, qty, duration in zip(indexes, all_qty, all_duration):
            qty_list[index] = qty
            duration_list[index] = duration

        d_temp = pd.DataFrame(list(zip(qty_list, duration_list)), columns = [keyword+'qty', keyword+'duration'])

        data_list.append(d_temp)

    result = pd.concat(data_list)

代码正在工作,但是它确实很慢,我真的想摆脱循环。如果有人可以向我展示一种更优化的方法,我将不胜感激。

1 个答案:

答案 0 :(得分:1)

代码性能的主要问题是多个循环。您可以使用pandas内置方法将所有循环委派给numpy的C实现。

例如,对总和,整形和展平索引使用df1,然后与df_temp = df_main.groupby(['users', 'keywords']).sum().unstack() df_temp.columns = 'total_' + df_temp.columns.map('_'.join) # flatten column index df1 = df1.merge(df_temp, left_on='USER_ID', right_on='users') 合并。

       total_quantity_k1  total_quantity_k2  total_quantity_k3  \
users
id1                 20.0                NaN                NaN
id2                  NaN                4.0                NaN
id3                  NaN                2.0                4.0
id4                  NaN                NaN                4.0

       total_duration_k1  total_duration_k2  total_duration_k3
users
id1                  2.0                NaN                NaN
id2                  NaN                6.0                NaN
id3                  NaN                3.0                2.0
id4                  NaN                NaN                2.0
  USER_ID  k1  k2  k3  total_quantity_k1  total_quantity_k2  \
0     id1   1   0   0               20.0                NaN
1     id2   0   1   0                NaN                4.0
2     id3   0   1   1                NaN                2.0
3     id4   0   0   1                NaN                NaN

   total_quantity_k3  total_duration_k1  total_duration_k2  total_duration_k3
0                NaN                2.0                NaN                NaN
1                NaN                NaN                6.0                NaN
2                4.0                NaN                3.0                2.0
3                4.0                NaN                NaN                2.0

输出

{{1}}