基于RDD中两个阵列中的匹配的Spark过滤

时间:2015-08-21 16:16:29

标签: scala apache-spark

我有一个RDD的单词,而不是我的另一个RDD包含一个字符串,如果匹配,它将从字符串中删除。

val wordList = sc.textFile("wordList.txt").map(x => x.split(',')).map(x => x(0))

wordList示例:

res15: Array[String] = Array(basetting, choosinesses, concavenesses, crabbinesses, cupidinously, falliblenesses, fleecinesses, hackishes, immaterialnesses, impiousnesses)

比我有我的另一个:

val filterWord = posts.map(x => (x._1, x._2.split(" ").filter(x => x != (wordList)))

示例filterWord:

res16: Array[(String, Array[String])] = Array((6,Array(how, sweet, is, it, that, we, have)), (2,Array("")), (2,Array(will, this, question, cause, an, error)), (2,Array("")), (4,Array(how, do, we, create, a, new, tag, in), (7,Array("")), (2,Array(test, after, clr, on)), (2,Array("")), (2,Array(testing, a, long, tag)), (2,Array("")))

我需要让filterWord只包含不在wordList中的单词,但似乎没有效果,因为不会过滤wordList中的任何单词,如果我将其更改为==相反,它过滤掉了一切。

1 个答案:

答案 0 :(得分:2)

这将删除包含wordlist中任何单词的任何帖子。它可能是也可能不是你想要的。请澄清你的问题。

Spark设置。

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf

val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc = new SparkContext(conf)

测试数据:

val jabberwocky = """
Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe.

“Beware the Jabberwock, my son!
      The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
      The frumious Bandersnatch!”

He took his vorpal sword in hand;
      Long time the manxome foe he sought—
So rested he by the Tumtum tree
      And stood awhile in thought.

And, as in uffish thought he stood,
      The Jabberwock, with eyes of flame,
Came whiffling through the tulgey wood,
      And burbled as it came!

One, two! One, two! And through and through
      The vorpal blade went snicker-snack!
He left it dead, and with its head
      He went galumphing back.

“And hast thou slain the Jabberwock?
      Come to my arms, my beamish boy!
O frabjous day! Callooh! Callay!”
      He chortled in his joy.

’Twas brillig, and the slithy toves
      Did gyre and gimble in the wabe:
All mimsy were the borogoves,
      And the mome raths outgrabe
"""
val words = "the and in all were"

将测试数据转换为RDD。

val posts = sc.parallelize(jabberwocky.split('\n')
                                      .filter(_.nonEmpty)
                                      .zipWithIndex
                                      .map (_.swap))

val wordList = sc.parallelize(words.split(' ')).map(x => (x.toLowerCase(), x))

创建一个PairRDD,其中每个帖子中的每个单词都有一行。关键是每个单词,值是原始帖子

val postsPairs = posts.flatMap
    { case (i, s) => s.split("\\W+").map(w=> (w.toLowerCase(), (i, s))) }

查找包含其中一个排除字词的所有帖子

  val withExcluded = postsPairs.join(wordList).map(_._2._1)

(可以在这里做.distinct,但没有意义,重复项对下一步不重要)

从原始列表中删除包含其中一个排除字词的所有帖子。所以剩下的都没有被排除在外的单词。 WWWWW。

  val res = posts.subtract(withExcluded)

  // (19,      He went galumphing back.)
  // (22,O frabjous day! Callooh! Callay!”)
  // (21,      Come to my arms, my beamish boy!)