Question

我目前正在开发一个用C ++开发拼写检查器的学校项目。对于检查单词是否存在的部分，我目前执行以下操作：

我在网上找到了一个包含所有英文单词的.txt文件
我的脚本首先浏览这些文本文件并放置每个文件它在地图对象中的条目，以便于访问。

这种方法的问题在于程序启动时，步骤2）大约需要20秒。这本身并不是什么大问题，但我想知道你是否有人想到了另一种让我的数据库快速可用的方法。例如，是否有办法将地图对象存储在文件中，以便我不需要每次都从文本文件构建它？

Answer 1

如果包含所有英文单词的文件不是动态文件，则可以将其存储在静态地图中。为此，您需要解析.txt文件，例如：

alpha

beta

伽马

...

将其转换为以下内容：

Four

您可以通过编程方式或简单地使用您喜欢的文本编辑器中的查找和替换来执行此操作。

你的.exe会比以前重得多，但它也会比从文件中读取这些信息的速度快得多。

Answer 2

我有点惊讶，没有人提出序列化的想法。 Boost为这种解决方案提供了很好的支持。如果我理解正确，那么问题在于，无论何时使用应用程序，读取单词列表（并将它们放入希望提供快速查找操作的数据结构）都需要很长时间。构建这样的结构，然后将其保存到二进制文件中以供以后重用，可以提高应用程序的性能（基于下面给出的结果）。

这是一段代码（同时也是一个最小的工作示例），可以帮助你解决这个问题。

#include <chrono>
#include <fstream>
#include <iostream>
#include <set>
#include <sstream>
#include <stdexcept>
#include <string>

#include <boost/archive/binary_iarchive.hpp>
#include <boost/archive/binary_oarchive.hpp>
#include <boost/serialization/set.hpp> 

#include "prettyprint.hpp"

class Dictionary {
public:
  Dictionary() = default;
  Dictionary(std::string const& file_)
    : _file(file_)
  {}

  inline size_t size() const { return _words.size(); }

  void build_wordset()
  {
    if (!_file.size()) { throw std::runtime_error("No file to read!"); }

    std::ifstream infile(_file);
    std::string line;

    while (std::getline(infile, line)) {
      _words.insert(line);
    }
  }

  friend std::ostream& operator<<(std::ostream& os, Dictionary const& d)
  {
    os << d._words;  // cxx-prettyprint used here
    return os;
  }

  int save(std::string const& out_file) 
  { 
    std::ofstream ofs(out_file.c_str(), std::ios::binary);
    if (ofs.fail()) { return -1; }

    boost::archive::binary_oarchive oa(ofs); 
    oa << _words;
    return 0;
  }

  int load(std::string const& in_file)
  {
    _words.clear();

    std::ifstream ifs(in_file);
    if (ifs.fail()) { return -1; }

    boost::archive::binary_iarchive ia(ifs);
    ia >> _words;
    return 0;
  }

private:
  friend class boost::serialization::access;

  template <typename Archive>
  void serialize(Archive& ar, const unsigned int version)
  {
    ar & _words;
  }

private:
  std::string           _file;
  std::set<std::string> _words;
};

void create_new_dict()
{
  std::string const in_file("words.txt");
  std::string const ser_dict("words.set");

  Dictionary d(in_file);

  auto start = std::chrono::system_clock::now();
  d.build_wordset();
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Building up the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;

  d.save(ser_dict);
}

void use_existing_dict()
{
  std::string const ser_dict("words.set");

  Dictionary d;

  auto start = std::chrono::system_clock::now();
  d.load(ser_dict);
  auto end = std::chrono::system_clock::now();
  auto elapsed =
    std::chrono::duration_cast<std::chrono::milliseconds>(end - start);

  std::cout << "Loading in the dictionary took: " << elapsed.count()
            << " (ms)" << std::endl
            << "Size of the dictionary: " << d.size() << std::endl;
}

int main()
{
  create_new_dict();
  use_existing_dict();
  return 0;
}

很抱歉没有将代码放入单独的文件中以及设计不佳;但是，出于演示目的，它应该就够了。

请注意，我没有使用地图：我只是没有看到存储大量零或其他任何不必要的点。 AFAIK，std::set由与std::map相同的强大RB树支持。

对于可用的数据集here（它包含大约466k字），我得到了以下结果：

Building up the dictionary took: 810 (ms)
Size of the dictionary: 466544
Loading in the dictionary took: 271 (ms)
Size of the dictionary: 466544

依赖关系：

Boost's Serialization component（但是，我使用的是版本1.58）。
louisdx/cxx-prettyprint。

希望这会有所帮助。 :)干杯。

Answer 3

首先要做的事情。不要使用地图（或集合）来存储单词列表。使用字符串向量，确保其内容已排序（我相信您的单词列表已经排序），然后使用＆lt; algorithm＆gt;中的 binary_find 。用于检查单词是否已在字典中的标题。

虽然这可能仍然高度次优（取决于您的编译器是否进行了小的字符串优化），但您的加载时间将至少提高一个数量级。做一个基准测试，如果你想让它更快，可以在字符串向量上发布另一个问题。

如何轻松快速地存储大字数据库？

3 个答案: