Question

我基本上有两个文件input1和input2（都是List [string]）。我想检查它们是否是彼此的子串/相同。所以，我有以下

Val conf = new SparkConf().setAppName (“check identical”)
Val sc = new SparkContext(conf)
val input 1 = sc.textFile(inputFile-L)
Val input 2 = sc.textFile(inputFile-M) 

// split up words
val words1 = input1.flatMap(line=> line.split(""))
Val words2 = input2.flatMap(line=>line.split(""))

// Transform into word and count 
val counts1 = words1.map(word => (word, reducebyKey{case(x,y) => x+y})
val counts2 = words2.map(word => (word, reducebyKey{case(x,y) => x+y})

通过上面，我确保字数相同，现在如何比较子集？有什么简单的方法吗？

Answer 1

这应该可以解决问题：

val notInWords1 = words2.filterNot(w => words1.contains(w))
val notInWords2 = words1.filterNot(w => words2.contains(w))

val bothAreEqual = notInWords1.isEmpty && notInWords2.isEmpty

val subset = Option( if(notInWords1.isEmpty && notInWords2.nonEmpty) words2
                else if(notInWords2.isEmpty && notInWords1.nonEmpty) words1)

val oneIsASubsetOfTheOther = subset.isDefined
val words1IsSubsetOfWords2 = subset.getOrElse(false) == words1
val words2IsSubsetOfWords1 = subset.getOrElse(false) == words2

Answer 2

这个怎么样？

words1.foreach(words2.contains)

Answer 3

使用 Scala 编程：

def hasSequence[A](l1: List[A], l2: List[A]): Boolean =  {
    def matchSeq(l1: List[A], l2: List[A], l2Original: List[A]): Boolean = {
        (l1, l2) match {
           case (_, Nil) => true
           case (Nil, _) => false //short circuit this case if l2.size > l1.size
           case (h1::t1, h2::t2) if h1 == h2 => matchSeq(t1, t2, l2Original)
           case (_::tail1, _) =>matchSeq(tail1, l2Original, l2Original)
        }
     }
  if(l2.isEmpty) false else matchSeq(l1, l2, l2)
}

如何检查列表[string]是否是scala中另一个列表[string]的子集

3 个答案: