我rdd
类型为RDD[(String, String)]
:
输入RDD:
val rdd = sc.parallelize(Seq(("java", "perl"),(".Net", "php"),("java","perl")))
(java, perl)
(.Net, php)
(java, perl)
我想要输出RDD[(String, String, Int)]
,其中元组中的第三项将是类似集的计数。 e.g:
输出RDD:
(java, perl, 2)
(.Net, php, 1)
我尝试在输入RDD中的每个记录中添加一个,然后按键减少以获取计数:
val t = rdd.map { case (a,b) => (a,b,1) }
(java, perl, 1)
(.Net, php, 1)
(java, perl, 1)
但是t.reduceByKey((a,b,c) => (a,b,c))
发出错误:
value reduceByKey is not a member of org.apache.spark.rdd.RDD[(String, String, Int)]
t.reduceByKey((a,b,c) => (a,b,c))
我还将输出RDD转换为DataFrame。
答案 0 :(得分:1)
您可以通过加入两个值来创建新密钥,然后添加如下所示:
lines = sc.parallelize(["java, perl", ".Net, php", "java, perl"])
splitted = lines.map(lambda l: l.split(","))
processed = splitted.map(lambda l: (l[0] + "," + l[1], 1))
reduced = processed.reduceByKey(lambda a, b: a+b)
或者简单地将整行视为" key":
lines = sc.parallelize(["java, perl", ".Net, php", "java, perl"])
processed = lines.map(lambda l: (l, 1))
reduced = processed.reduceByKey(lambda a, b: a + b)
<强>输出:强>
>>> lines.collect()
['java, perl', '.Net, php', 'java, perl']
>>> reduced.collect()
[('.Net, php', 1), ('java, perl', 2)]
修改强>
您可以定义一个函数来格式化数据并使用map
转换:
def formatter(line):
skills = line[0].split()
return skills[0], skills[1], line[1]
threecols = reduced.map(formatter)