来自RDD的Spark过滤广播变量

时间:2017-08-09 05:46:23

标签: apache-spark

我正在学习广播变量并试图从RDD中过滤掉那些变量。这种情况不会发生在我身上。

这是我的示例数据

content.txt

Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce

remove.txt

Hello, is, this, the

脚本

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)

scala> val bRemove = sc.broadcast(removeRDD.collect().toList)

scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}

scala> filtered.foreach(print)
  

你好这是Rogers.com这是Bell.comApache Spark TrainingThis   Spark Learning SessionSpark比MapReduce

更快

如上所示,已过滤的列表仍包含广播变量。我怎样才能删除这些?

1 个答案:

答案 0 :(得分:1)

这是因为您要使用","拆分文件,但您的文件是以空格" "分隔的。

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))

将其替换为

scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(" "))

使用此选项忽略大小写

val filtered = contentRDD.filter{case (word) =>
     !bRemove.value.map(_.toLowerCase).contains(word.toLowerCase()
)}

霍普这应该有效!