Question

在计算文字中单词准确度的频率时，如何忽略“＆＃39; a＆＃39;＆＃39;＆＃39;”这样的单词？

TypeError: Cannot read property 'forEach' of undefined

答案将是。但我希望距离是最常用的词。

Answer 1

最好避免像这样开始计算条目。

ignore = {'the','a','if','in','it','of','or'}
result = collections.Counter(x for x in f if x not in ignore).most_common(1)

Answer 2

另一种选择是使用stop_words的{{1}}参数这些是您不感兴趣的词，将被分析仪丢弃。

CountVectorizer

请注意，f = CountVectorizer(stop_words={'the','a','if','in','it','of','or'}).build_analyzer()(str(df['phrase'])) result = collections.Counter(f).most_common(1) print result [(u'distance', 1)]不执行预处理（小写，重音剥离）或删除停用词，因此您需要在此处使用分析器。

您还可以使用tokenizer自动删除英语停用词（有关完整列表，请参阅stop_words='english'）。

在计算文本中单词准确度的频率时，如何忽略某些单词？

2 个答案: