我有一个客户评论语料库,想要识别稀有词,对我来说,这些词出现在少于1%的语料库文档中。
我已经有一个可行的解决方案,但是对于我的脚本来说太慢了:
# Review data is a nested list of reviews, each represented as a bag of words
doc_clean = [['This', 'is', 'review', '1'], ['This', 'is', 'review', '2'], ..]
# Save all words of the corpus in a set
all_words = set([w for doc in doc_clean for w in doc])
# Initialize a list for the collection of rare words
rare_words = []
# Loop through all_words to identify rare words
for word in all_words:
# Count in how many reviews the word appears
counts = sum([word in set(review) for review in doc_clean])
# Add word to rare_words if it appears in less than 1% of the reviews
if counts / len(doc_clean) <= 0.01:
rare_words.append(word)
有人知道更快的实现吗?在每个单独的评论中迭代每个单词似乎很耗时。
先谢谢您,并祝您一切顺利, 马库斯
答案 0 :(得分:5)
这可能不是最有效的解决方案,但它易于理解和维护,我经常自己使用它。我使用Counter和Pandas:
MatExpr operator *
将计数器应用于每个文档并构建词频矩阵:
MatExpr::operator Mat() const
矩阵中的某些字段未定义。它们对应于特定文档中未出现的单词。计算出现次数:
Txtbin::binarize
现在,选择不经常出现的单词:
import pandas as pd
from collections import Counter