Question

def normalize_data(data):
   df_matrix = pd.pivot_table(data, values='purchase_count', index='customerId', columns='productId')
   df_matrix_norm = (df_matrix-df_matrix.min())/(df_matrix.max()-df_matrix.min())
   d = df_matrix_norm.reset_index()
   d.index.names = ['scaled_purchase_freq']
   return pd.melt(d, id_vars=['customerId'],value_name='scaled_purchase_freq').dropna()

上面的代码工作正常，但速度很慢，当我增加数据大小时会出现内存错误。 data是一个包含customerId，productId和Purchase_count的数据框，表示每个客户购买产品的次数。

customerId,productId,product_count
21,24186,1
28,25949,1
31,12962,1
31,26246,1
38,26683,1
43,1667,1
50,10831,1
54,47752,1
63,47672,1
64,35108,1
71,48953,1
75,26882,1
77,11777,1
90,32648,1
91,33754,1

df_matrix所需的输出（将购买记录归一化为0到1）

    customerId productId  scaled_purchase_freq
    9   0   0.133333
    25  0   0.133333
    33  0   0.133333
    36  0   0.133333
    44  0   0.133333

所需的输出只是一个示例。我需要帮助以找到更有效的方法来规范化数据。

更快，更有效的pd.pivot_table大熊猫替代品？

0 个答案: