Apache Spark - 如何计算配对RDD中的类似键/值对

时间:2016-08-19 11:42:06

标签: apache-spark rdd

rdd类型为RDD[(String, String)]

输入RDD:

val rdd = sc.parallelize(Seq(("java", "perl"),(".Net", "php"),("java","perl")))

(java, perl)
(.Net, php)
(java, perl)

我想要输出RDD[(String, String, Int)],其中元组中的第三项将是类似集的计数。 e.g:

输出RDD:

(java, perl, 2)
(.Net, php, 1)

我尝试在输入RDD中的每个记录中添加一个,然后按键减少以获取计数:

val t = rdd.map { case (a,b) => (a,b,1) }
(java, perl, 1)
(.Net, php, 1)
(java, perl, 1)

但是t.reduceByKey((a,b,c) => (a,b,c))发出错误:

value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, String, Int)]
t.reduceByKey((a,b,c) => (a,b,c))

我还将输出RDD转换为DataFrame。

1 个答案:

答案 0 :(得分:1)

您可以通过加入两个值来创建新密钥,然后添加如下所示:

lines = sc.parallelize(["java, perl", ".Net, php", "java, perl"])
splitted = lines.map(lambda l: l.split(","))
processed = splitted.map(lambda l: (l[0] + "," + l[1], 1))
reduced = processed.reduceByKey(lambda a, b: a+b)

或者简单地将整行视为" key":

lines = sc.parallelize(["java, perl", ".Net, php", "java, perl"])
processed = lines.map(lambda l: (l, 1))
reduced = processed.reduceByKey(lambda a, b: a + b)

<强>输出:

>>> lines.collect()
['java, perl', '.Net, php', 'java, perl']
>>> reduced.collect()
[('.Net, php', 1), ('java, perl', 2)]

修改

您可以定义一个函数来格式化数据并使用map转换:

def formatter(line):
    skills = line[0].split()
    return skills[0], skills[1], line[1]

threecols = reduced.map(formatter)