Question

有没有办法计算RDD每行的单词出现次数，而不是使用map和reduce来计算完整的RDD？

例如，如果RDD [String]包含以下两行：

让我们玩得开心。

为了获得乐趣，你不需要任何计划。

然后输出应该像包含键值对的映射：

（＆＃34;设＆＃39; S＆＃34;，1）
  （＆＃34;具有＆＃34;，1）
  （＆＃34;一些＆＃34;，1）
  （＆＃34;乐趣＆＃34;，1）


（＆＃34;至＆＃34;，1）
（＆＃34;具有＆＃34;，1）
（＆＃34;乐趣＆＃34;，1）
（＆＃34;你＆＃34;，1）
（＆＃34;不要＆＃39; T＆＃34;，1）
（＆＃34;需要＆＃34;，1）<登记/>（＆＃34;计划＆＃34;，1）

Answer 1

你想要的是将一条线转换成一个地图（字，数）。因此，您可以逐行定义函数计数：

def wordsCount(line: String):Map[String,Int] = {
 line.split(" ").map(v => (v,1)).groupBy(_._1).mapValues(_.size)
}

然后将其应用于您的RDD [String]：

val lines:RDD[String] = ...
val wordsByLineRDD:RDD[Map[String,Int]] = lines.map(wordsCount)
// this should give you a Map per line with count of each word
wordsByLineRDD.take(2)
// Something like
// Array(Map(some -> 1, have -> 1, Let's -> 1, fun. -> 1), Map(any -> 1, have -> 1, don't -> 1, you -> 1, need -> 1, fun -> 1, To -> 1, plans. -> 1))

Answer 2

如果您刚开始使用Spark并且没有人告诉您使用它，请不要使用RDD API。有很多更好的，通常更有效的Spark SQL API可以做到这一点以及Spark中大型数据集的许多其他分布式计算。

使用RDD API就像使用汇编程序一样，可以使用Scala（或其他更高级别的编程语言）。在开始您的Spark之旅时，考虑到我个人推荐Spark SQL的高级API以及DataFrames和Datasets，这当然太多了。

鉴于输入：

$ cat input.txt
Let's have some fun.

To have fun you don't need any plans.

并且您要使用数据集API，您可以执行以下操作：

val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line                                 |words |
+-------------------------------------+------+
|Let's have some fun.                 |Let's |
|Let's have some fun.                 |have  |
|Let's have some fun.                 |some  |
|Let's have some fun.                 |fun.  |
|                                     |      |
|To have fun you don't need any plans.|To    |
|To have fun you don't need any plans.|have  |
|To have fun you don't need any plans.|fun   |
|To have fun you don't need any plans.|you   |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need  |
|To have fun you don't need any plans.|any   |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+

scala> wordsPerLine.
  groupBy("line", "words").
  count.
  withColumn("word_count", struct($"words", $"count")).
  select("line", "word_count").
  groupBy("line").
  agg(collect_set("word_count")).
  show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line                                 |collect_set(word_count)                                                       |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun.                 |[[have,1], [fun.,1], [Let's,1], [some,1]]                                     |
|                                     |[[,1]]                                                                        |
+-------------------------------------+------------------------------------------------------------------------------+

完成。 简单，不是吗？

请参阅functions对象（适用于explode和struct个功能）。

Answer 3

根据我的理解，您可以执行以下操作你说你有RDD[String]数据

val data = Seq("Let's have some fun.",
  "To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)

您可以将flatMap应用于split lines并在(word, 1)函数中创建tuples map

val output = rddData.flatMap(_.split(" ")).map(word => (word, 1))

应该能为您提供所需的输出

output.foreach(println)

要按行出现，您应该执行以下操作

val output = rddData.map(_.split(" ").map((_, 1)).groupBy(_._1)
  .map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }.toList).flatMap(tuple => tuple)

Answer 4

假设您有这样的rdd

val data = Seq("Let's have some fun.",
  "To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)

然后只需先应用flapMap，然后再应用map

val res = rddData.flatMap(line => line.split(" ")).map(word => (word,1))

预期产量

res.take(100)
res4: Array[(String, Int)] = Array((Let's,1), (have,1), (some,1), (fun.,1), (To,1), (have,1), (fun,1), (you,1), (don't,1), (need,1), (any,1), (plans.,1))

Answer 5

虽然是个老问题；我一直在 pySpark 中寻找答案。最终管理如下。

file_ = cont_.parallelize (
    ["shots are shots that are shots with more big shots by big people",
     "people comes in all shapes and sizes, as people are idoits of the idiots",
     "i know what i am writing is nonsense, but i don't care because i am doing this to test my spark program",
     "my spark is a current spark, that spark in my eyes."]
)

file_ \
.map(lambda x : [((x, i), 1) for i in x.split()]) \
.flatMap(lambda x : x) \
.reduceByKey(lambda x, y : x + y) \
.sortByKey(False) \
.map(lambda x : (x[0][1], x[1])) \
.collect()

如何使用RDD计算文本文件中每行的单词数？

5 个答案: