如何使用aggregateByKey追加并将大小限制为10?

时间:2018-04-04 08:55:48

标签: python apache-spark pyspark

我有一个RDD

[playerID, gameID, amount_played]

我想按键对playerID进行分组,每个玩家ID只需最多50个

RDD.aggregateByKey(\
                  0, # initial value for an accumulator \
                  lambda r, v: r + v, # function that adds a value to an accumulator \
                  lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
                 ).take(1)

1 个答案:

答案 0 :(得分:0)

您可以使用按键合并:

def appender(a,b):
    a.append(b)
    return a

def extender(a, b):
    a.extend(b)
    return a

recommendRDD.combineByKey(\
              lambda movieId: [movieId], #make a list of the initial value \
              appender,\ #the appender adds a movie to a pre-created list
              extender)\ # combines two pre-created lists
            .take(1)

如果您需要限制电影数量,只需向appenderextender功能添加逻辑:

def appender(a,b):
    a.append(b)
    return a[:10]

def extender(a, b):
    a.extend(b)
    return a[:10]

但你需要小心限制,因为你可能会排除推荐最高的电影。