Question

在Python中，set非常方便用于比较2个字符串列表（请参阅此link）。我想知道在性能方面是否有一个很好的C ++解决方案。因为每个列表中都有超过100万个字符串。

这是区分大小写的匹配。

Answer 1

数据类型std::set<>（通常实现为平衡树）和std::unordered_set<>（来自C ++ 11，实现为哈希）可用。还有一种称为std::set_intersection的便捷算法，用于计算实际交叉点。

这是一个例子。

#include <iostream>
#include <vector>
#include <string>
#include <set>             // for std::set
#include <algorithm>       // for std::set_intersection

int main()
{
  std::set<std::string> s1 { "red", "green", "blue" };
  std::set<std::string> s2 { "black", "blue", "white", "green" };

  /* Collecting the results in a vector. The vector may grow quite
     large -- it may be more efficient to print the elements directly. */     
  std::vector<std::string> s_both {};

  std::set_intersection(s1.begin(),s1.end(),
                        s2.begin(),s2.end(),
                        std::back_inserter(s_both));

  /* Printing the elements collected by the vector, just to show that
     the result is correct. */
  for (const std::string &s : s_both)
    std::cout << s << ' ';
  std::cout << std::endl;

  return 0;
}

请注意。如果您想使用std::unordered_set<>，则std::set_intersection不能像这样使用，因为它需要对输入集进行排序。你必须使用通常的for循环迭代技术迭代较小的集合并找到较大集合中的元素来确定交集。然而，对于大量元素（尤其是字符串），基于散列的std::unordered_set<>可能更快。还有与STL兼容的实现，例如Boost（boost::unordered_set）中的实现和Google创建的实现（sparse_hash_set and dense_hash_set）。对于各种其他实现和基准（包括一个字符串），请参阅here。

Answer 2

如果你不需要太多表现我建议使用STL的地图/套装：

list<string> list, list2;
...
set<string> sndList;
list<string> result;

for(list<string>::iterator it = list2.begin(); it != list2.end(); ++it)
   sndList.insert(*it);

for(list<string>::iteratir it = list.begin(); it != list.end(); ++it)
    if(sndList.count(*it) > 0)
        result.push_back(*it);

否则我建议使用一些散列函数进行比较。

Answer 3

如果它确实是std::list，请对它们进行排序并使用set_intersection：

list<string> words1;
list<string> words2;
list<string> common_words;

words1.sort();
words2.sort();

set_intersection(words1.begin(), words1.end(),
                 words2.begin(), words2.end(),
                 back_inserter(common_words));

C ++比较2个字符串列表

3 个答案: