我正在使用Scala在Spark中编写一个小程序,并遇到了一个问题。我有一个单字串的List / RDD和一个List / RDD的句子,这些句子可能包含也可能不包含单个单词列表中的单词。即。
val singles = Array("this", "is")
val sentence = Array("this Date", "is there something", "where are something", "this is a string")
我想选择包含单个单词中一个或多个单词的句子,结果应该是这样的:
output[(this, Array(this Date, this is a String)),(is, Array(is there something, this is a string))]
我想到了两种方法,一种是通过拆分句子并使用.contains进行过滤。另一种是将句子分割并格式化为RDD并使用.join进行RDD交集。我正在查看大约50个单词和500万个句子,哪种方法会更快?还有其他解决方案吗?你能帮我编写代码吗?我的代码似乎没有得到任何结果(虽然编译并运行没有错误)
答案 0 :(得分:5)
您可以创建一组必需的键,在句子中查找键并按键分组。
val singles = Array("this", "is")
val sentences = Array("this Date",
"is there something",
"where are something",
"this is a string")
val rdd = sc.parallelize(sentences) // create RDD
val keys = singles.toSet // words required as keys.
val result = rdd.flatMap{ sen =>
val words = sen.split(" ").toSet;
val common = keys & words; // intersect
common.map(x => (x, sen)) // map as key -> sen
}
.groupByKey.mapValues(_.toArray) // group values for a key
.collect // get rdd contents as array
// result:
// Array((this, Array(this Date, this is a string)),
// (is, Array(is there something, this is a string)))
答案 1 :(得分:1)
我刚刚尝试解决您的问题,但我最终得到了这段代码:
def check(s:String, l: Array[String]): Boolean = {
var temp:Int = 0
for (element <- l) {
if (element.equals(s)) {temp = temp +1}
}
var result = false
if (temp > 0) {result = true}
result
}
val singles = sc.parallelize(Array("this", "is"))
val sentence = sc.parallelize(Array("this Date", "is there something", "where are something", "this is a string"))
val result = singles.cartesian(sentence)
.filter(x => check(x._1,x._2.split(" ")) == true )
.groupByKey()
.map(x => (x._1,x._2.mkString(", ") )) // pay attention here(*)
result.foreach(println)
最后一个地图行(*)就是因为没有它我得到了CompactBuffer的东西,像这样:
(is,CompactBuffer(is there something, this is a string))
(this,CompactBuffer(this Date, this is a string))
使用该映射行(使用mkString命令),我得到一个更可读的输出:
(is,is there something, this is a string)
(this,this Date, this is a string)
希望它能以某种方式提供帮助。
FF