如何使用RDD计算文本文件中每行的单词数?

时间:2017-05-16 07:01:15

标签: scala apache-spark

有没有办法计算RDD每行的单词出现次数,而不是使用map和reduce来计算完整的RDD?

例如,如果RDD [String]包含以下两行:

  

让我们玩得开心。

     

为了获得乐趣,你不需要任何计划。

然后输出应该像包含键值对的映射:

  

("设' S",1)
  ("具有",1)
  ("一些",1)
  ("乐趣",1)

     

("至",1)
("具有",1)
("乐趣",1)
("你",1)
("不要' T",1)
(&#34;需要&#34;,1)<登记/>(&#34;计划&#34;,1)

5 个答案:

答案 0 :(得分:3)

你想要的是将一条线转换成一个地图(字,数)。因此,您可以逐行定义函数计数:

def wordsCount(line: String):Map[String,Int] = {
 line.split(" ").map(v => (v,1)).groupBy(_._1).mapValues(_.size)
}

然后将其应用于您的RDD [String]:

val lines:RDD[String] = ...
val wordsByLineRDD:RDD[Map[String,Int]] = lines.map(wordsCount)
// this should give you a Map per line with count of each word
wordsByLineRDD.take(2)
// Something like
// Array(Map(some -> 1, have -> 1, Let's -> 1, fun. -> 1), Map(any -> 1, have -> 1, don't -> 1, you -> 1, need -> 1, fun -> 1, To -> 1, plans. -> 1))

答案 1 :(得分:3)

如果您刚开始使用Spark并且没有人告诉您使用它,请不要使用RDD API。有很多更好的,通常更有效的Spark SQL API可以做到这一点以及Spark中大型数据集的许多其他分布式计算。

使用RDD API就像使用汇编程序一样,可以使用Scala(或其他更高级别的编程语言)。在开始您的Spark之旅时,考虑到我个人推荐Spark SQL的高级API以及DataFrames和Datasets,这当然太多了。

鉴于输入:

$ cat input.txt
Let's have some fun.

To have fun you don't need any plans.

并且您要使用数据集API,您可以执行以下操作:

val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line                                 |words |
+-------------------------------------+------+
|Let's have some fun.                 |Let's |
|Let's have some fun.                 |have  |
|Let's have some fun.                 |some  |
|Let's have some fun.                 |fun.  |
|                                     |      |
|To have fun you don't need any plans.|To    |
|To have fun you don't need any plans.|have  |
|To have fun you don't need any plans.|fun   |
|To have fun you don't need any plans.|you   |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need  |
|To have fun you don't need any plans.|any   |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+

scala> wordsPerLine.
  groupBy("line", "words").
  count.
  withColumn("word_count", struct($"words", $"count")).
  select("line", "word_count").
  groupBy("line").
  agg(collect_set("word_count")).
  show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line                                 |collect_set(word_count)                                                       |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun.                 |[[have,1], [fun.,1], [Let's,1], [some,1]]                                     |
|                                     |[[,1]]                                                                        |
+-------------------------------------+------------------------------------------------------------------------------+

完成。 简单,不是吗?

请参阅functions对象(适用于explodestruct个功能)。

答案 2 :(得分:2)

根据我的理解,您可以执行以下操作 你说你有RDD[String]数据

val data = Seq("Let's have some fun.",
  "To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)

您可以将flatMap应用于split lines并在(word, 1)函数中创建tuples map

val output = rddData.flatMap(_.split(" ")).map(word => (word, 1))

应该能为您提供所需的输出

output.foreach(println)

要按行出现,您应该执行以下操作

val output = rddData.map(_.split(" ").map((_, 1)).groupBy(_._1)
  .map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }.toList).flatMap(tuple => tuple)

答案 3 :(得分:0)

假设您有这样的rdd

val data = Seq("Let's have some fun.",
  "To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)

然后只需先应用flapMap,然后再应用map

val res = rddData.flatMap(line => line.split(" ")).map(word => (word,1))

预期产量

res.take(100)
res4: Array[(String, Int)] = Array((Let's,1), (have,1), (some,1), (fun.,1), (To,1), (have,1), (fun,1), (you,1), (don't,1), (need,1), (any,1), (plans.,1))

答案 4 :(得分:0)

虽然是个老问题;我一直在 pySpark 中寻找答案。最终管理如下。

file_ = cont_.parallelize (
    ["shots are shots that are shots with more big shots by big people",
     "people comes in all shapes and sizes, as people are idoits of the idiots",
     "i know what i am writing is nonsense, but i don't care because i am doing this to test my spark program",
     "my spark is a current spark, that spark in my eyes."]
)

file_ \
.map(lambda x : [((x, i), 1) for i in x.split()]) \
.flatMap(lambda x : x) \
.reduceByKey(lambda x, y : x + y) \
.sortByKey(False) \
.map(lambda x : (x[0][1], x[1])) \
.collect()