Question

我正在尝试找到查找大文件中哪些行包含某个单词的最佳方法。

例如，如果您有以下文件：

cat dog monkey 
banana chair elephant 
monkey phone platypus cat

我希望它能够为“猫”返回0,2

我希望函数原型看起来像这样：

std::vector<int> FindWords(std::string word);

我想将文件预处理成一些数据结构，这样就可以快速完成锁定，给出包含单词的行号。我知道std :: map可以做到这一点，如果只有一个单词的实例，但还有更多。

最合适的算法是什么？

Answer 1

为文件中的所有唯一字构建一个trie数据结构。

对于trie中的每个单词，存储文件中单词所在的行号列表。这可以通过文件一次完成。

您也可以使用地图来存储每个单词的行号列表，但是trie会更紧凑。

下面添加了对trie数据结构的C声明。如果你想自己实现，这应该让你知道如何开始。

/*
 * TRIE data structure defined for lower-case letters(a-z)
 */
typedef struct trie {
  char c;                           /* Letter represented by the trie node */
  struct trie *child[26];           /* Child pointers, one for each of the 26 letters of the alphabet */
  bool isTerminal;                  /* If any word ends at that node, TRUE, else FALSE */
  int counts;                       /* Number of lines the word ending at node occurs in the text */
  int lines[MAX_NUM];               /* Line numbers of the word occurences in the text */
} trie;

/*
 * Insert a word into the trie.
 * word - Word which is being inserted
 * line - Line number of word in the text.
 */
void insertToTrie(trie *node, const char *word, int line);

Answer 2

您也可以使用std :: multimap或更好的std :: unordered_multimap，因为您不需要仅对某个值的元素迭代整个地图集合。

修改简单的例子：

#include <iostream> #include <unordered_map> int main() { std::unordered_multimap<std::string, int> mymap; mymap.insert(std::pair<std::string, int>("word", 1)); mymap.insert(std::pair<std::string, int>("anotherword", 2)); mymap.insert(std::pair<std::string, int>("word", 10)); for (auto it = mymap.find("word"); it != mymap.end() && it->first == "word"; it++) { std::cout << it->second << std::endl; } }

Answer 3

当您搜索单个字符串时，Boyer-Moore字符串搜索算法比特里快。很可能你可以修改多个字符串。

C ++，在大文件中搜索一行中的单词的算法？

3 个答案: