Question

我有一个制表符分隔的文本文件。我需要提取第二个元素，并仅对出现在第二个元素中的单词进行单词计数。（我还需要过滤少于3个字符的单词，并希望将单词显示为键，并按计数的降序计为值。）

我可以使用

读取文件

scala> val lines = sc.textFile("MYDIR/myfile").map(_.split("\t"))

scala> lines.take(3)

我得到Array[Array[String]] =

Array(Array(abc, Here is the First Text, en, Thu Sep 26 08:25:42 CDT 2013, null),
      Array(def, and here is the Second text, en, Thu Sep 26 08:27:22 CDT 2013, null),
      Array(ghi, and here is Another text, en, Thu Sep 26 08:50:21 CDT 2013, null))

如果我映射以获得第二个赞美

val wrdStr = lines.map(ar=>ar(1).toLowerCase)

wrdStr.take(3)
Array[String] = Array(here is the first text, and here is the second text, and here is Another text)

我想做基本的wordcount，但是如果我.flatMap(_.split("\\W+"))，并且为每个单词添加1，我就不再有RDD，所以当我尝试执行reduce操作时，它会失败。如何实现单词计数？一旦我映射到第二个元素？

Answer 1

您可以执行以下操作

wrdStr.flatMap(line => line.split("\\W+"))
    .filter(word => word.length > 2)
    .map(word => (word, 1))
    .reduceByKey(_ + _)
    .sortBy(x => x._2, ascending = false)
    .foreach(println)

您应该有以下输出

(text,3)
(here,3)
(and,2)
(the,2)
(second,1)
(another,1)
(first,1)

计算制表符分隔文件中字符串元素中的单词

1 个答案: