Question

在Tstringlist中查找重复项的最快方法是什么。我获得了在Stringlist中搜索重复项所需的数据。我目前的想法是这样的：

    var  TestStringList, DataStringList  : TstringList;


    for i := 0 to  DataStringList.Items-1 do
    begin
        if TestStringList.Indexof(DataStringList[i])< 0 < 0 then
        begin
          TestStringList.Add(DataStringList[i])
        end
        else
        begin
           memo1.ines.add('duplicate item found');
        end;

    end;
   ....

Answer 1

为了完整性，（并且因为你的代码实际上并没有使用副本，只是表明已找到一个）：Delphi的TStringList具有处理重复条目的内置功能，其中{ {1}}财产。将其设置为Duplicates将简单地丢弃您尝试添加的任何重复项。请注意，目标列表必须排序，或dupIgnore无效。

Duplicates

快速测试显示，如果您使用TestStringList.Sorted := True; TestStringList.Duplicates := dupIgnore; for i := 0 to DataStringList.Items-1 do TestStringList.Add(DataStringList[i]); Memo1.Lines.Add(Format('%d duplicates discarded', [DataStringList.Count - TestStringList.Count]));和Sorted，则可以删除整个循环：

Duplicates

有关详细信息，请参阅TStringList.Duplicates文档。

Answer 2

我认为您正在寻找重复项。如果是，那么您可以执行以下操作：

案例1：订购字符串列表

在这种情况下，重复项必须出现在相邻的索引处。在这种情况下，您只需从1循环到Count-1，并检查索引i的元素是否与索引i-1的元素相同。

案例2：未订购字符串列表

在这种情况下，我们需要一个双循环。它看起来像这样：

for i := 0 to List.Count-1 do
  for j := i+1 to List.Count-1 do
    if List[i]=List[j] then
      // duplicate found

有性能方面的考虑因素。如果列表是有序的，则搜索是O（N）。如果列表未被排序，则搜索为O（N ²）。显然前者更可取。由于列表可以按复杂度O（N log N）进行排序，如果性能成为一个因素，那么在搜索重复项之前对列表进行排序将是有利的。

Answer 3

使用IndexOf判断你使用未排序的列表。然后，算法的缩放因子为n ^ 2。那很慢。您可以通过在内部搜索中限制搜索区域来优化大卫，然后平均因子将是n^2/2 - 但仍然会严重缩放。

注意：这里的缩放因子对于有限的工作负载是有意义的，例如每个列表有十几个或几百个字符串。对于更大的数据集渐近分析O（...）度量将更适合。但是，为QuickSort和哈希列表找到O-measure是一项微不足道的任务。

选项1：对列表进行排序。使用快速排序，对于大负载，它将具有缩放因子n + n*log(n)或O(n*log(n))。

设置重复项以接受
将排序设为True
迭代排序列表并检查下一个字符串是否存在且是否相同
http://docwiki.embarcadero.com/Libraries/XE3/en/System.Classes.TStringList.Duplicates
http://docwiki.embarcadero.com/Libraries/XE3/en/System.Classes.TStringList.Sorted

选项2：使用散列列表帮助器。在现代的Delphi中，TDictionary<String,Boolean>，在较旧的Delphi中有TMemIniFile使用的类

您迭代字符串列表，然后检查字符串是否已添加到帮助程序集合中。

缩放因子对于小数据块是常量，对于大数据块是O(1) - 请参阅http://docwiki.embarcadero.com/Libraries/XE2/en/System.Generics.Collections.TDictionary.ContainsKey

如果不是 - 您将其添加为“false”值。
如果是 - 您将值切换为“true”

对于较旧的Delphi，您可以使用类似模式的THashedStringList（感谢@FreeConsulting）

http://docs.embarcadero.com/products/rad_studio/delphiAndcpp2009/HelpUpdate2/EN/html/delphivclwin32/IniFiles_THashedStringList_IndexOf.html

Answer 4

不幸的是，目前还不清楚你要对重复项做什么。你的else子句建议你只想知道是否有一个（或多个）副本。虽然这可能是最终目标，但我认为你想要更多。

提取重复项

先前给出的答案删除或计算重复的项目。这是保留的答案。

procedure ExtractDuplicates1(List1, List2: TStringList; Dupes: TStrings);
var
  Both: TStringList;
  I: Integer;
begin
  Both := TStringList.Create;
  try
    Both.Sorted := True;
    Both.Duplicates := dupAccept;
    Both.AddStrings(List1);
    Both.AddStrings(List2);
    for I := 0 to Both.Count - 2 do
      if (Both[I] = Both[I + 1]) then
        if (Dupes.Count = 0) or (Dupes[Dupes.Count - 1] <> Both[I]) then
          Dupes.Add(Both[I]);
  finally
    Both.Free;
  end;
end;

性能

尝试以下备选方案以比较上述例程的性能。

procedure ExtractDuplicates2(List1, List2: TStringList; Dupes: TStrings);
var
  Both: TStringList;
  I: Integer;
begin
  Both := TStringList.Create;
  try
    Both.AddStrings(List1);
    Both.AddStrings(List2);
    Both.Sort;
    for I := 0 to Both.Count - 2 do
      if (Both[I] = Both[I + 1]) then
        if (Dupes.Count = 0) or (Dupes[Dupes.Count - 1] <> Both[I]) then
          Dupes.Add(Both[I]);
  finally
    Both.Free;
  end;
end;

procedure ExtractDuplicates3(List1, List2, Dupes: TStringList);
var
  I: Integer;
begin
  Dupes.Sorted := True;
  Dupes.Duplicates := dupAccept;
  Dupes.AddStrings(List1);
  Dupes.AddStrings(List2);
  for I := Dupes.Count - 1 downto 1 do
    if (Dupes[I] <> Dupes[I - 1]) or (I > 1) and (Dupes[I] = Dupes[I - 2]) then
      Dupes.Delete(I);
  if (Dupes.Count > 1) and (Dupes[0] <> Dupes[1]) then
    Dupes.Delete(0);
  while (Dupes.Count > 1) and (Dupes[0] = Dupes[1]) do
    Dupes.Delete(0);
end;

虽然ExtractDuplicates3略微表现更好，但我更喜欢ExtractDuplicates1，因为它更好地挖掘，TStrings参数提供了更多可用性。 ExtractDuplicates2表现出明显的最差，这表明在一次运行之后对所有项目进行排序需要更多时间，然后连续排序每个添加的项目。

注意

这个答案是this recent answer的一部分，我正要问同一个问题：“如何保留重复？”。我没有，但如果有人知道或找到更好的解决方案，请评论，添加或更新此答案。

Answer 5

这是一个老线程，但我认为这个解决方案可能很有用。

一个选项是将值从一个字符串列表泵送到另一个字符串列表，设置为TestStringList.Duplicates := dupError;，然后捕获异常。

var  TestStringList, DataStringList  : TstringList;
TestStringList.Sorted := True;
TestStringList.Duplicates := dupError;

for i := 0 to  DataStringList.Items-1 do
begin
    try
      TestStringList.Add(DataStringList[i])
    except
        on E : EStringListError do begin
            memo1.Lines.Add('duplicate item found');
        end;
    end;
end;

...

请注意，异常的捕获还会掩盖以下错误：没有足够的内存来扩展列表，列表试图超出其最大容量，引用了列表中不存在的元素。（即列表索引超出范围）。

Answer 6

function TestDuplicates(const dataStrList: TStringList): integer;
begin 
  with TStringlist.create do begin
    {Duplicates:= dupIgnore;}
    for it:= 0 to DataStrList.count-1 do begin
      if IndexOf(DataStrList[it])< 0 then
        Add(DataStrList[it])
      else 
        inc(result)
    end;
    Free;
  end;
end;

快速查找字符串列表中的重复项

6 个答案:

提取重复项

性能

注意