你如何加速计算mongodb中大量(约100万)文档上的bigrams / trigrams?

时间:2017-04-20 13:11:22

标签: python mongodb apache-spark pyspark pymongo

我在mongodb中有大约一百万个带有大文本字段的文档。我想提取最有意义的术语。我目前的逻辑是使用类似于下面逻辑的逻辑来计算Python中每周的bigrams。问题是这种逻辑很慢。有没有更快的方法来做这件事?

from nltk.tokenize import sent_tokenize

for week_start,week_end in zip(weeks[:-1],weeks[1:]):
    all_top_words = Counter()
    for post in collection.find( {'date': {'$lt': week_end, '$gte': week_start},'text':{"$exists":True}}):
        text = strip_tags(post['text'])
        text = remove_brackets(text)
        sentences = sent_tokenize(text)
        for sentence in sentences:
            sentence = sentence.lower()
            sentence = remove_punctuation(sentence)
            top_words=Counter(ngrams(sentence.split(" "),2))
            all_top_words += top_words

0 个答案:

没有答案