我正在寻找更快的NLTK替代方案来分析大型语料库并做基本的事情,比如计算频率,PoS标记等...... SpaCy看起来很棒,很容易在很多方面使用,但我找不到任何内置的例如,用于计算特定单词的频率。我查看了spaCy文档,但我找不到一种直接的方法。我错过了什么吗?
我想要的是NLTK相当于:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
在NLTK中,上面的代码告诉我,在我的文本中,“word”这个词出现了X次。
请注意,我来自count_by函数,但它似乎没有做我想要的。
答案 0 :(得分:1)
Python stdlib包含 collections.Counter 用于此类目的。如果这个答案适合你的情况,你还没有给我一个答案。
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
print(freq)
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
print(freq['Lorem'])
>>> 4
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
结果,
149.01646934738716
foo: 9998872
18.093295297389773
我仍然不确定这对你来说是否足够。
答案 1 :(得分:1)
我经常使用spaCy进行语料库中的频率计数。这通常是我要做的:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
print(lemma_word)
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)