Question

一种方法是

list.distinct.size != list.size

还有更好的方法吗？拥有containsDuplicates方法

会很不错

Answer 1

你也可以写：

list.toSet.size != list.size

但结果将是相同的，因为distinct已经implemented with a Set。在这两种情况下，时间复杂度应为 O（n）：您必须遍历列表并且Set插入 O（1）。

Answer 2

假设“更好”意味着“更快”，请参阅this question中基准测试的替代方法，这似乎显示了一些更快的方法（尽管注意，distinct使用HashSet并且已经是O（n））。 YMMV当然，取决于具体的测试用例，scala版本等。可能与“distinct.size”方法相比，任何明显的改进都会在发现重复时立即提前，但加速的大小是多少。实际获得的内容在很大程度上取决于您的用例中实际的重复项。

如果你的意思是“更好”，你想写list.containsDuplicates而不是containsDuplicates(list)，请使用隐含的：

implicit def enhanceWithContainsDuplicates[T](s:List[T]) = new {
  def containsDuplicates = (s.distinct.size != s.size)
}

assert(List(1,2,2,3).containsDuplicates)
assert(!List("a","b","c").containsDuplicates)

Answer 3

我认为一旦找到副本就会停止，并且可能比执行distinct.size更有效 - 因为我假设distinct也保留了一组：

@annotation.tailrec
def containsDups[A](list: List[A], seen: Set[A] = Set[A]()): Boolean = 
  list match {
    case x :: xs => if (seen.contains(x)) true else containsDups(xs, seen + x)
    case _ => false
}

containsDups(List(1,1,2,3))
// Boolean = true

containsDups(List(1,2,3))
// Boolean = false

我意识到你要求的很简单，我现在不知道这个版本是什么，但找到一个副本也发现是否有一个以前见过的元素：

def containsDups[A](list: List[A]): Boolean =  {
  list.iterator.scanLeft(Set[A]())((set, a) => set + a) // incremental sets
    .zip(list.iterator)
    .exists{ case (set, a) => set contains a }
}

Answer 4

@annotation.tailrec 
def containsDuplicates [T] (s: Seq[T]) : Boolean = 
  if (s.size < 2) false else 
    s.tail.contains (s.head) || containsDuplicates (s.tail)

我没有衡量这一点，并认为它与huynhjl的解决方案类似，但更容易理解。

如果找到重复，它会提前返回，所以我查看了Seq.contains的来源，这是否会提前返回 - 确实如此。

在SeqLike中，'contains（e）'被定义为'exists（_ == e）'，而exists存在于TraversableLike中定义：

def exists (p: A => Boolean): Boolean = {
  var result = false
  breakable {
    for (x <- this)
      if (p (x)) { result = true; break }
  }
  result
}

我很好奇如何在多核上使用并行集合加速，但我想这是早期返回的一般问题，而另一个线程将继续运行，因为它不知道，解决方案是已经找到了。

Answer 5

<强>要点：我编写了一个非常有效的函数，它返回List.distinct和List两个元素，这些元素出现不止一次，元素重复的索引也出现了。

注意：此答案为straight copy of the answer on a related question。

<强>详细信息：如果您需要更多关于重复项本身的信息，就像我一样，我编写了一个更通用的函数，它在List（因为排序很重要）中迭代一次并返回Tuple2组成的原始List重复数据删除（删除第一个之后的所有重复项;即与调用distinct相同），第二个List显示每个副本和Int索引发生在原始List内。

这里的功能是：

def filterDupes[A](items: List[A]): (List[A], List[(A, Int)]) = {
  def recursive(remaining: List[A], index: Int, accumulator: (List[A], List[(A, Int)])): (List[A], List[(A, Int)]) =
    if (remaining.isEmpty)
      accumulator
    else
      recursive(
          remaining.tail
        , index + 1
        , if (accumulator._1.contains(remaining.head))
            (accumulator._1, (remaining.head, index) :: accumulator._2)
          else
            (remaining.head :: accumulator._1, accumulator._2)
      )
  val (distinct, dupes) = recursive(items, 0, (Nil, Nil))
  (distinct.reverse, dupes.reverse)
}

以下是一个可能使其更直观的示例。给定此字符串值列表：

val withDupes =
  List("a.b", "a.c", "b.a", "b.b", "a.c", "c.a", "a.c", "d.b", "a.b")

...然后执行以下操作：

val (deduped, dupeAndIndexs) =
  filterDupes(withDupes)

......结果是：

deduped: List[String] = List(a.b, a.c, b.a, b.b, c.a, d.b)
dupeAndIndexs: List[(String, Int)] = List((a.c,4), (a.c,6), (a.b,8))

如果您只想要重复项，只需map dupeAndIndexes并调用distinct：

val dupesOnly =
  dupeAndIndexs.map(_._1).distinct

...或者只需一次通话：

val dupesOnly =
  filterDupes(withDupes)._2.map(_._1).distinct

...或如果首选Set，请跳过distinct并调用toSet ...

val dupesOnly2 =
  dupeAndIndexs.map(_._1).toSet

...或者只需一次通话：

val dupesOnly2 =
  filterDupes(withDupes)._2.map(_._1).toSet

这是我的开源Scala库ScalaOlio中的filterDupes函数的直接副本。它位于org.scalaolio.collection.immutable.List_._。

Answer 6

如果您正在尝试检查测试中的重复项，那么ScalaTest可能会有所帮助。

import org.scalatest.Inspectors._
import org.scalatest.Matchers._
forEvery(list.distinct) { item =>
  withClue(s"value $item, the number of occurences") {
    list.count(_ == item) shouldBe 1
  }
}

// example:
scala> val list = List(1,2,3,4,3,2)
list: List[Int] = List(1, 2, 3, 4, 3, 2)

scala> forEvery(list) { item => withClue(s"value $item, the number of occurences") { list.count(_ == item) shouldBe 1 } }
org.scalatest.exceptions.TestFailedException: forEvery failed, because:
  at index 1, value 2, the number of occurences 2 was not equal to 1 (<console>:19),
  at index 2, value 3, the number of occurences 2 was not equal to 1 (<console>:19)
in List(1, 2, 3, 4)

最简单的方法来确定List是否包含重复项？

6 个答案: