我正在学习广播变量并试图从RDD中过滤掉那些变量。这种情况不会发生在我身上。
这是我的示例数据
content.txt
Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
remove.txt
Hello, is, this, the
脚本
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))
scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)
scala> val bRemove = sc.broadcast(removeRDD.collect().toList)
scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}
scala> filtered.foreach(print)
你好这是Rogers.com这是Bell.comApache Spark TrainingThis Spark Learning SessionSpark比MapReduce
更快
如上所示,已过滤的列表仍包含广播变量。我怎样才能删除这些?
答案 0 :(得分:1)
这是因为您要使用","
拆分文件,但您的文件是以空格" "
分隔的。
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))
将其替换为
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(" "))
使用此选项忽略大小写
val filtered = contentRDD.filter{case (word) =>
!bRemove.value.map(_.toLowerCase).contains(word.toLowerCase()
)}
霍普这应该有效!