Question

我需要一种非常有效的方法来查找未排序序列中的重复项。这就是我提出的，但它有一些缺点，即它

不必要地计算超出2的发生次数
在产生重复之前消耗整个序列
创建了几个中间序列

module Seq = 
  let duplicates items =
    items
    |> Seq.countBy id
    |> Seq.filter (snd >> ((<) 1))
    |> Seq.map fst

无论有什么缺点，我都没有理由用两倍的代码替换它。是否有可能通过比较简洁的代码来改善这一点？

Answer 1

更优雅的功能解决方案：

let duplicates xs =
  Seq.scan (fun xs x -> Set.add x xs) Set.empty xs
  |> Seq.zip xs
  |> Seq.choose (fun (x, xs) -> if Set.contains x xs then Some x else None)

使用scan累积到目前为止看到的所有元素的集合。然后使用zip将每个元素与之前的元素集合在一起。最后，使用choose过滤掉先前看到的元素集中的元素，即重复元素。

修改

其实我原来的回答是完全错误的。首先，您不希望输出中出现重复项。其次，你想要表现。

这是一个纯功能解决方案，可以实现您所需的算法：

let duplicates xs = (Map.empty, xs) ||> Seq.scan (fun xs x -> match Map.tryFind x xs with | None -> Map.add x false xs | Some false -> Map.add x true xs | Some true -> xs) |> Seq.zip xs |> Seq.choose (fun (x, xs) -> match Map.tryFind x xs with | Some false -> Some x | None | Some true -> None)

这使用一个映射来跟踪每个元素之前是否曾被看过一次或多次，然后如果看到之前只看过一次，即第一次重复，就会发出该元素。

这是一个更快的命令式版本：

let duplicates (xs: _ seq) = seq { let d = System.Collections.Generic.Dictionary(HashIdentity.Structural) let e = xs.GetEnumerator() while e.MoveNext() do let x = e.Current let mutable seen = false if d.TryGetValue(x, &seen) then if not seen then d.[x] <- true yield x else d.[x] <- false }

这比你的任何其他答案（在撰写本文时）快2倍左右。

使用for x in xs do循环枚举序列中的元素比直接使用GetEnumerator要慢得多，但生成自己的Enumerator并不比使用{{yield计算表达式快得多1}}。

请注意TryGetValue Dictionary成员允许我通过改变堆栈分配值来避免在内部循环中进行分配，而使用F＃提供的TryGetValue扩展成员（并由kvb使用）在他/她的回答中）分配其返回元组。

Answer 2

这是一个必要的解决方案（诚然稍长）：

let duplicates items =
    seq {
        let d = System.Collections.Generic.Dictionary()
        for i in items do
            match d.TryGetValue(i) with
            | false,_    -> d.[i] <- false         // first observance
            | true,false -> d.[i] <- true; yield i // second observance
            | true,true  -> ()                     // already seen at least twice
    }

Answer 3

这是我能想到的最好的“功能性”解决方案，它不会预先消耗整个序列。

let duplicates =
    Seq.scan (fun (out, yielded:Set<_>, seen:Set<_>) item -> 
        if yielded.Contains item then
            (None, yielded, seen)
        else
            if seen.Contains item then
                (Some(item), yielded.Add item, seen.Remove item)
            else
                (None, yielded, seen.Add item)
    ) (None, Set.empty, Set.empty)
    >> Seq.Choose (fun (x,_,_) -> x)

Answer 4

假设你的序列是有限的，这个解决方案需要在序列上运行一次：

open System.Collections.Generic
let duplicates items =
   let dict = Dictionary()
   items |> Seq.fold (fun acc item -> 
                             match dict.TryGetValue item with
                             | true, 2 -> acc
                             | true, 1 -> dict.[item] <- 2; item::acc
                             | _ -> dict.[item] <- 1; acc) []
         |> List.rev

您可以提供序列的长度作为Dictionary的容量，但它需要再次枚举整个序列。

修改要解决第二个问题，可以按需生成重复项：

open System.Collections.Generic let duplicates items = seq { let dict = Dictionary() for item in items do match dict.TryGetValue item with | true, 2 -> () | true, 1 -> dict.[item] <- 2; yield item | _ -> dict.[item] <- 1 }

Answer 5

功能解决方案：

let duplicates items = 
  let test (unique, result) v =
    if not(unique |> Set.contains v) then (unique |> Set.add v ,result) 
    elif not(result |> Set.contains v) then (unique,result |> Set.add v) 
    else (unique, result)
  items |> Seq.fold test (Set.empty, Set.empty) |> snd |> Set.toSeq

以有效的方式查找未排序的重复项

5 个答案: