Question

我正在阅读几篇文档，并将我读过的单词编入索引。但是，我想忽略常见的单词（a，an，the，and，is，or，等等）。

这样做有捷径吗？莫索比做... ...

if（word ==“和”|| word ==“是”|| etc etc ....）忽略单词;

例如，我可以以某种方式将它们放入const字符串中，并且只是检查字符串吗？不确定......谢谢！

Answer 1

使用您要排除的字词创建set<string>，然后使用mySet.count(word)确定该字词是否在该字词集中。如果是，则计数为1;否则将是0。

#include <iostream>
#include <set>
#include <string>
using namespace std;

int main() {
    const char *words[] = {"a", "an", "the"};
    set<string> wordSet(words, words+3);
    cerr << wordSet.count("the") << endl;
    cerr << wordSet.count("quick") << endl;
    return 0;
}

Answer 2

您可以使用字符串数组，循环并匹配每个字符串，或使用更优化的数据结构，例如set或trie。

以下是如何使用普通数组执行此操作的示例：

const char *commonWords[] = {"and", "is" ...};
int commonWordsLength = 2; // number of words in the array

for (int i = 0; i < commonWordsLength; ++i)
{
    if (!strcmp(word, commonWords[i]))
    {
        //ignore word;
        break;
    }
}

请注意，此示例不使用C ++ STL，但您应该这样做。

Answer 3

如果你想最大限度地提高性能，你应该创建一个特里....

http://en.wikipedia.org/wiki/Trie

......停止词......

http://en.wikipedia.org/wiki/Stop_words

没有标准的C ++ trie数据结构，但是请参阅第三方实现的这个问题...

Trie implementation

如果你不能为此烦恼并希望使用标准容器，那么最好使用的是unordered_set<string>，它会将停用词放在哈希表中。

bool filter(const string& word)
{
    static unordered_set<string> stopwords({"a", "an", "the"});
    return !stopwords.count(word);
}

忽略几个不同的单词.. c ++？

3 个答案: