有没有办法计算RDD每行的单词出现次数,而不是使用map和reduce来计算完整的RDD?
例如,如果RDD [String]包含以下两行:
让我们玩得开心。
为了获得乐趣,你不需要任何计划。
然后输出应该像包含键值对的映射:
("设' S",1)
("具有",1)
("一些",1)
("乐趣",1)
("至",1)
("具有",1)
("乐趣",1)
("你",1)
("不要' T",1)
(&#34;需要&#34;,1)<登记/>(&#34;计划&#34;,1)
答案 0 :(得分:3)
你想要的是将一条线转换成一个地图(字,数)。因此,您可以逐行定义函数计数:
def wordsCount(line: String):Map[String,Int] = {
line.split(" ").map(v => (v,1)).groupBy(_._1).mapValues(_.size)
}
然后将其应用于您的RDD [String]:
val lines:RDD[String] = ...
val wordsByLineRDD:RDD[Map[String,Int]] = lines.map(wordsCount)
// this should give you a Map per line with count of each word
wordsByLineRDD.take(2)
// Something like
// Array(Map(some -> 1, have -> 1, Let's -> 1, fun. -> 1), Map(any -> 1, have -> 1, don't -> 1, you -> 1, need -> 1, fun -> 1, To -> 1, plans. -> 1))
答案 1 :(得分:3)
如果您刚开始使用Spark并且没有人告诉您使用它,请不要使用RDD API。有很多更好的,通常更有效的Spark SQL API可以做到这一点以及Spark中大型数据集的许多其他分布式计算。
使用RDD API就像使用汇编程序一样,可以使用Scala(或其他更高级别的编程语言)。在开始您的Spark之旅时,考虑到我个人推荐Spark SQL的高级API以及DataFrames和Datasets,这当然太多了。
鉴于输入:
$ cat input.txt
Let's have some fun.
To have fun you don't need any plans.
并且您要使用数据集API,您可以执行以下操作:
val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line |words |
+-------------------------------------+------+
|Let's have some fun. |Let's |
|Let's have some fun. |have |
|Let's have some fun. |some |
|Let's have some fun. |fun. |
| | |
|To have fun you don't need any plans.|To |
|To have fun you don't need any plans.|have |
|To have fun you don't need any plans.|fun |
|To have fun you don't need any plans.|you |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need |
|To have fun you don't need any plans.|any |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+
scala> wordsPerLine.
groupBy("line", "words").
count.
withColumn("word_count", struct($"words", $"count")).
select("line", "word_count").
groupBy("line").
agg(collect_set("word_count")).
show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line |collect_set(word_count) |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun. |[[have,1], [fun.,1], [Let's,1], [some,1]] |
| |[[,1]] |
+-------------------------------------+------------------------------------------------------------------------------+
完成。 简单,不是吗?
请参阅functions对象(适用于explode
和struct
个功能)。
答案 2 :(得分:2)
根据我的理解,您可以执行以下操作
你说你有RDD[String]
数据
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
您可以将flatMap
应用于split
lines
并在(word, 1)
函数中创建tuples
map
val output = rddData.flatMap(_.split(" ")).map(word => (word, 1))
应该能为您提供所需的输出
output.foreach(println)
要按行出现,您应该执行以下操作
val output = rddData.map(_.split(" ").map((_, 1)).groupBy(_._1)
.map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }.toList).flatMap(tuple => tuple)
答案 3 :(得分:0)
假设您有这样的rdd
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
然后只需先应用flapMap
,然后再应用map
val res = rddData.flatMap(line => line.split(" ")).map(word => (word,1))
预期产量
res.take(100)
res4: Array[(String, Int)] = Array((Let's,1), (have,1), (some,1), (fun.,1), (To,1), (have,1), (fun,1), (you,1), (don't,1), (need,1), (any,1), (plans.,1))
答案 4 :(得分:0)
虽然是个老问题;我一直在 pySpark 中寻找答案。最终管理如下。
file_ = cont_.parallelize (
["shots are shots that are shots with more big shots by big people",
"people comes in all shapes and sizes, as people are idoits of the idiots",
"i know what i am writing is nonsense, but i don't care because i am doing this to test my spark program",
"my spark is a current spark, that spark in my eyes."]
)
file_ \
.map(lambda x : [((x, i), 1) for i in x.split()]) \
.flatMap(lambda x : x) \
.reduceByKey(lambda x, y : x + y) \
.sortByKey(False) \
.map(lambda x : (x[0][1], x[1])) \
.collect()