我基本上有两个文件input1和input2(都是List [string])。我想检查它们是否是彼此的子串/相同。所以,我有以下
Val conf = new SparkConf().setAppName (“check identical”)
Val sc = new SparkContext(conf)
val input 1 = sc.textFile(inputFile-L)
Val input 2 = sc.textFile(inputFile-M)
// split up words
val words1 = input1.flatMap(line=> line.split(""))
Val words2 = input2.flatMap(line=>line.split(""))
// Transform into word and count
val counts1 = words1.map(word => (word, reducebyKey{case(x,y) => x+y})
val counts2 = words2.map(word => (word, reducebyKey{case(x,y) => x+y})
通过上面,我确保字数相同,现在如何比较子集?有什么简单的方法吗?
答案 0 :(得分:0)
这应该可以解决问题:
val notInWords1 = words2.filterNot(w => words1.contains(w))
val notInWords2 = words1.filterNot(w => words2.contains(w))
val bothAreEqual = notInWords1.isEmpty && notInWords2.isEmpty
val subset = Option( if(notInWords1.isEmpty && notInWords2.nonEmpty) words2
else if(notInWords2.isEmpty && notInWords1.nonEmpty) words1)
val oneIsASubsetOfTheOther = subset.isDefined
val words1IsSubsetOfWords2 = subset.getOrElse(false) == words1
val words2IsSubsetOfWords1 = subset.getOrElse(false) == words2
答案 1 :(得分:0)
这个怎么样?
words1.foreach(words2.contains)
答案 2 :(得分:0)
使用 Scala 编程:
def hasSequence[A](l1: List[A], l2: List[A]): Boolean = {
def matchSeq(l1: List[A], l2: List[A], l2Original: List[A]): Boolean = {
(l1, l2) match {
case (_, Nil) => true
case (Nil, _) => false //short circuit this case if l2.size > l1.size
case (h1::t1, h2::t2) if h1 == h2 => matchSeq(t1, t2, l2Original)
case (_::tail1, _) =>matchSeq(tail1, l2Original, l2Original)
}
}
if(l2.isEmpty) false else matchSeq(l1, l2, l2)
}