Question

我正在从SSD（固态设备）中读取一个非常大的文件，其中存有整数。

int main () {
  string line;
  srand (time(NULL));

  set<int> vec;
  for(unsigned long int j=0; j<1342177280; ++j){
     i = rand() % 10 + 1; //although my code performs something complex, for simplicity I am taking random numbers.
     vec.insert(i);
  } 

  ifstream myfile ("example.txt");
  if (myfile.is_open())
  {
    int sum=0;
    while ( getline (myfile,line) )
    {
      int i=atoi(line.c_str());
      if(vec.count(i))
            sum+=i;
    }
    myfile.close();
  }

  else cout << "Unable to open file"; 

  return 0;
}

由于vec是一个非常大的集合，所以vec.count（i）对我来说是最大的时间。有什么方法可以减少我的代码中的查找时间（vec.count（i））。如果是，那么有人可以指导我如何实现同样的目标吗？

我正在使用gcc版本4.8

Answer 1

以下是使用std::vector的示例。我写了一个特殊的insert()函数，确保std::vector保持唯一并排序：

// unique, sorted inserts
void insert(vector<int>& v, int i)
{
    // find insert position in sorted order
    auto found = std::lower_bound(v.begin(), v.end(), i);

    // avoid duplicates
    if(found == v.end() || *found != i)
        v.insert(found, i);
}

int main () {
  string line;
  srand (time(NULL));

  vector<int> vec;
  for(unsigned long int j=0; j<1342177280; ++j){
     int i = rand() % 10 + j; //although my code performs something complex, for simplicity I am taking random numbers.
     insert(vec, i); // use our special insert() function
  }

  ifstream myfile ("example.txt");
  if (myfile.is_open())
  {
    int sum=0;
    while ( getline (myfile,line) )
    {
      int i=atoi(line.c_str());
      // binary search has O(log n) complexity
      if(std::binary_search(vec.begin(), vec.end(), i))
            sum+=i;
    }
    myfile.close();
  }

  else cout << "Unable to open file";

  return 0;
}

std::vector可能会给你提供比std::set更好的性能的原因是向量可以很好地处理CPU缓存，因为它们存储在连续的内存中。因此，即使向量大于CPU缓存，它仍然会受益，因为缓存会以块的形式加载向量。

修改

在我的测试中（仅使用100000000个数字），向量始终优于集合：

sum time v: 2121140014233422 11.1849 secs s: 2121140014233422 15.2953 secs v: 2121140014233422 11.2197 secs s: 2121140014233422 15.0505 secs v: 2121140014233422 11.1063 secs s: 2121140014233422 14.9652 secs

Answer 2

如果要使用set，请使用find方法。它的复杂性是对数的。

if(vec.find(val) != vec.end())
    //val is there in set

如何减少c ++中的查找时间

2 个答案: