我正在寻找在对话中找到单词列表中所有单词的次数。不考虑每个单词的个别频率而只考虑总计数。单词列表包括ngrams uptill 3
from nltk.util import ngrams
find = ['car', 'motor cycle', 'heavy traffic vehicle']
data = pd.read_csv('inputdata.csv')
def count_words(doc, find):
onegram = [' '.join(grams) for grams in ngrams(doc.split(), 1)]
bigrams = [' '.join(grams) for grams in ngrams(doc.split(), 2)]
trigrams = [' '.join(grams) for grams in ngrams(doc.split(), 3)]
n_gram = onegrams + bigrams + trigrams
''' get count of unique bag of words present in doc '''
lst = ".".join([i for i in find if i in n_gram])
cnt = np.count_nonzero(np.unique(lst.split(".")))
return cnt
result = data['text'].apply(lambda x: count_words(x, find))
这些步骤过程非常繁重,并且在大型数据集的情况下需要很长时间才能运行。什么是优化现有方法的选项还是有其他替代步骤?
答案 0 :(得分:0)
首先,将文档拆分为,而不是每次调用三次。
def count_words(doc, find):
word_list = doc.split()
onegram = [' '.join(grams) for grams in ngrams(word_list, 1)]
...
其次,使用集合Counter类可以很好地计算。然后计算在你的代码中是微不足道的,并且和Python一样快。