Question

我正在使用movieLens数据集，其目标是在为数据集中的每个用户创建前10个推荐列表之前，将数据集映射并转换为几个信息丰富且可读的数据框。数据集还包含下一个movieId，每个用户前往观看，用于检查每个用户的前10个建议中是否存在下一个movieId。

到目前为止，我已经能够在stackoverflow和/或一些教程中单独或借助一些附加信息来解决这个问题。我写了一个简单的函数，其中if / else语句用于映射RDD，如果nextmovie出现在我的前10个推荐列表中，则返回1，否则为0.

# MOVIE RECOMMENDOR

#All movies, orderd by popularity
allMoviesRDD = movieStatsDF.rdd.map(lambda x: x[0]).collect() 
sample: [296, 356, 593, 318, 480, 260, 110, 589, 2571, 527]

#Joined dataframe that im using
+------+---------+--------------------+
|userId|nextmovie|                seen|
+------+---------+--------------------+
|   148|     2629|[2396, 2671, 4306...|
|   471|     4122|[3101, 1645, 2858...|
|   833|      527|[364, 150, 432, 4...|
|  1580|     1196|[7161, 1203, 8610...|
|  1645|       26|[608, 107, 7, 256...|
+------+---------+--------------------+
  root
 |-- userId: integer (nullable = true)
 |-- nextmovie: integer (nullable = true)
 |-- seen: array (nullable = true)
 |    |-- element: string (containsNull = true


# Mapping function
def recommend(record):
    userid, nextmovie, seen = record

    if nextmovie in allMoviesRDD[:10]:
        return 1
    else:            
        return 0

# Call the function     
eval = joinDF.rdd.map(recommend)

# Print succes percentage
print ("Succes% =", 100 * eval.sum() / eval.count())

以上是非常粗糙和基本的，但它起作用并且成功率为3.936％。但是，由于用户可能已经看过前10名列表中的一部电影，我必须检查前10名列表以及用户已经看过的电影列表。如果两个列表中都有movieId，我必须将其从前10个推荐中删除，然后将下一个最佳电影添加到列表顶部。

我认为这非常简单，并将功能更改为以下内容：

def recommend(record):
    userId, nextmovie, seen = record
    recommendations = allMoviesRDD

    if recommendations[0] in seen:
        recommendations.remove(recommendations[0])

    if recommendations[1] in seen:
        recommendations.remove(recommendations[1])

    if recommendations[2] in seen:
        recommendations.remove(recommendations[2])

    if recommendations[3] in seen:
        recommendations.remove(recommendations[3])

    if recommendations[4] in seen:
        recommendations.remove(recommendations[4])

    # etc, etc... (i tried a for loop but it crashes the kernel)

    if nextmovie in recommendations[:10]:
        return 1
    else:            
        return 0

但无论我尝试什么，我似乎无法访问＆＃34;看到＆＃34;传递函数的每条记录的列表。我已经尝试过它或访问记录[2]并将其投射......但都无济于事。但是，我没有得到任何错误，所以我假设＆＃34;看到＆＃34;记录中的列表为空。但没有调试我真的没有线索:(

我花了几个小时寻找，尝试和失败......谁能帮助我理解这个该死的东西！？ :)非常感谢！

Python / Spark .map函数：从记录

0 个答案: