我有一个for循环的样子,但是一旦将一个大型数据集传递给它,它要花很长时间。
for i in range(0,len(data_sim.index)):
for j in range(1,len(data_sim.columns)):
user = data_sim.index[i]
activity = data_sim.columns[j]
if dt_full.loc[i][j] != 0:
data_sim.loc[i][j] = 0
else:
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchases = data_activity.loc[user,activity_top_names]
data_sim.loc[i][j] = getScore(user_purchases,activity_top_sims)
在for循环中,data_sim看起来像这样:
CustomerId A B C D E
1 NAs NAs NAs NAs NAs
2 ..
我试图在apply函数中重现相同的过程,如下所示:
def test(cell):
user = cell.index
activity = cell
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchase = data_activity_index.loc[user, activity_top_names]
if dt_full.loc[user][activity] != 0:
return cell.replace(cell, 0)
else:
re = getScore(user_purchase, activity_top_sims)
return cell.replace(cell, re)
在功能上,data_sim2如下所示,我将“ CustomerId”列设置为索引列,并将活动名称复制到每个活动列。
CustomerId(Index) A B C D E
1 A B C D E
2 A B C D E
在函数'def test(cell)'内,如果该单元格位于data_sim2 [1] [0]中,
cell.index = 1 # userId
cell # activity name
此for循环的整个想法是根据每个单元格的位置将评分数据拟合到“ data_sim”表中。我在创建函数时使用了相同的想法,在每个单元格中使用了相同的计算,然后将其应用于数据表“ data_sim”,
data_test = data_sim2.apply(lambda x: test(x))
它给了我一个错误
"sort_values() missing 1 required positional argument: 'by'"
这很奇怪,因为此问题不是在for循环内发生的。听起来'data_corr.loc [activity]'仍然是Dataframe而不是Series。