如何检查列表[string]是否是scala中另一个列表[string]的子集

时间:2016-12-13 00:42:53

标签: scala word-count

我基本上有两个文件input1和input2(都是List [string])。我想检查它们是否是彼此的子串/相同。所以,我有以下

Val conf = new SparkConf().setAppName (“check identical”)
Val sc = new SparkContext(conf)
val input 1 = sc.textFile(inputFile-L)
Val input 2 = sc.textFile(inputFile-M) 

// split up words
val words1 = input1.flatMap(line=> line.split(""))
Val words2 = input2.flatMap(line=>line.split(""))

// Transform into word and count 
val counts1 = words1.map(word => (word, reducebyKey{case(x,y) => x+y})
val counts2 = words2.map(word => (word, reducebyKey{case(x,y) => x+y})

通过上面,我确保字数相同,现在如何比较子集?有什么简单的方法吗?

3 个答案:

答案 0 :(得分:0)

这应该可以解决问题:

val notInWords1 = words2.filterNot(w => words1.contains(w))
val notInWords2 = words1.filterNot(w => words2.contains(w))

val bothAreEqual = notInWords1.isEmpty && notInWords2.isEmpty

val subset = Option( if(notInWords1.isEmpty && notInWords2.nonEmpty) words2
                else if(notInWords2.isEmpty && notInWords1.nonEmpty) words1)

val oneIsASubsetOfTheOther = subset.isDefined
val words1IsSubsetOfWords2 = subset.getOrElse(false) == words1
val words2IsSubsetOfWords1 = subset.getOrElse(false) == words2

答案 1 :(得分:0)

这个怎么样?

words1.foreach(words2.contains)

答案 2 :(得分:0)

使用 Scala 编程:

def hasSequence[A](l1: List[A], l2: List[A]): Boolean =  {
    def matchSeq(l1: List[A], l2: List[A], l2Original: List[A]): Boolean = {
        (l1, l2) match {
           case (_, Nil) => true
           case (Nil, _) => false //short circuit this case if l2.size > l1.size
           case (h1::t1, h2::t2) if h1 == h2 => matchSeq(t1, t2, l2Original)
           case (_::tail1, _) =>matchSeq(tail1, l2Original, l2Original)
        }
     }
  if(l2.isEmpty) false else matchSeq(l1, l2, l2)
}