Question

我正在使用tm包进行一些文本挖掘。我得到了包含超过50,000个单词的有序单词列表。我的语料库包含大约200万个单词，我将它们全部放在一个文档中。

为了节省一些记忆并且能够得到更多术语的ngrams（2和3grams），我想用语句替换语料库中的单词。我有两种方法可以做到这一点。

1）对于我有序的单词列表中的每个单词，我可以查找语料库中的所有位置，并用我想要的数字替换该单词。这意味着我要查看我的文档50,000次，每次检查所有200万字。这将是1000亿比较。

2）对于语料库中的200万个单词中的每个单词，在我的50,000个单词列表中进行查找。使用二进制搜索，我应该在列表中找到最多16次尝试的单词。这意味着我只需要进行3200万比较。

我一直在寻找SO并使用谷歌。我找到了一些代码和C和C ++的建议。现在我可以自己实现二进制文本搜索而没有问题，但我更愿意使用现有的包或函数，最好是实现并行处理的函数。

有什么建议吗？

Answer 1

你可以使用set data-structure来做到这一点;它适用于您的应用程序。或者您可以使用标准二进制搜索：

Set Data Structure in C++

Binary Search in C++

使用以下代码，您可以根据需要创建任意数量的线程;但是，根据我的经验，在大多数系统上，C ++的3200万次操作不到1秒。 C ++非常快。

您可以根据需要在两个线程之间划分200万个区间。

#include <string>
#include <iostream>
#include <thread>
#include <set>

using namespace std;

set <string> s; //Set also works in O(log n); no need for binary search
string corpus[2*1000*1000];
string dict[50000];

// The function we want to execute on the new thread.
void func(int start, int end)
{
    for(int i = start; i < end; i++){ // This is words in Corpus
        //============ YOU CAN ALSO WRITE YOUR BINARY SEARCH HERE ================

        // If you want you
        if(s.find(corpus[i]) != s.end()){
            //the word is in the 50000 dictionary
        }
        else {
            //It is not
        }
    }
}

int main()
{

    for(int i = 0; i < 50000; i++){
        s.insert(dict[i]);
    }

    // Constructs the new thread and runs it. Does not block execution.

    //=========== ADD AS MANY THREADS AS YOU WANT ======================
    thread t1(func, 1, 1000000);
    thread t2(func, 1000000, 2000000);

    // Makes the main thread wait for the new thread to finish execution, therefore blocks its own execution.
    t1.join();
    t2.join();
}

在有序单词列表上是否有二进制搜索的标准函数

1 个答案: