Question

我正在寻找更快的NLTK替代方案来分析大型语料库并做基本的事情，比如计算频率，PoS标记等...... SpaCy看起来很棒，很容易在很多方面使用，但我找不到任何内置的例如，用于计算特定单词的频率。我查看了spaCy文档，但我找不到一种直接的方法。我错过了什么吗？

我想要的是NLTK相当于：

tokens.count("word") #where tokens is the tokenized text in which the word is to be counted

在NLTK中，上面的代码告诉我，在我的文本中，“word”这个词出现了X次。

请注意，我来自count_by函数，但它似乎没有做我想要的。

Answer 1

Python stdlib包含 collections.Counter 用于此类目的。如果这个答案适合你的情况，你还没有给我一个答案。

from collections import Counter

text = "Lorem Ipsum is simply dummy text of the  ...."

freq = Counter(text.split())
print(freq)

>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})

print(freq['Lorem'])

>>> 4

好的只是为了给出一些时间参考，我已经使用过这个脚本，

import random, timeit
from collections import Counter

def loadWords():
    with open('corpora.txt', 'w') as corpora:
        randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
                     'right', 'left', 'plain', 'random', 'the']
        for i in range(100000000):
            corpora.write(randWords[random.randint(0, 9)] + " ")

def countWords():
    with open('corpora.txt', 'r') as corpora:
        content = corpora.read()
        myDict = Counter(content.split())
        print("foo: ", myDict['foo'])

print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))

结果，

149.01646934738716
foo: 9998872
18.093295297389773

我仍然不确定这对你来说是否足够。

Answer 2

我经常使用spaCy进行语料库中的频率计数。这通常是我要做的：

import spacy
nlp = spacy.load("en_core_web_sm")

list_of_words = ['run', 'jump', 'catch']

def word_count(string):
    words_counted = 0
    my_string = nlp(string)

    for token in my_string:
        # actual word
        word = token.text
        # lemma
        lemma_word = token.lemma_
        # part of speech
        word_pos = token.pos_
        if lemma_word in list_of_words:
            words_counted += 1
            print(lemma_word)
    return words_counted


sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)

内置功能可以通过spaCy获得一个单词的频率？

2 个答案:

好的只是为了给出一些时间参考，我已经使用过这个脚本，