我有这段代码:
# Remove words that appear less than X (e.g. 2) time(s)
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in text:
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 2] for text in texts]
现在这会过滤掉所有令牌,其中术语频率(所有文本中的总出现次数)低于2,或者文档频率(带有一次或多次的文本总数)低于2?
修改
# Get term frequencies (how many times a term occurs no matter what)
from collections import Counter
termfrequency = Counter()
for text in texts:
for token in text:
termfrequency[token] +=1
texts = [[token for token in text if termfrequency[token] > 2] for text in texts]
# Get document frequencies (in how many documents a term exists > 0 times)
from collections import Counter
documentfrequency = Counter()
for text in texts:
documentfrequency.update(set(text))
texts = [[token for token in text if documentfrequency[token] > 2] for text in texts]
答案 0 :(得分:0)
[我想计算]整个集合中出现一个单词的文档数量,无论它出现在任何特定文档中的次数。
这是一种方法:
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
for token in set(text):
# ^^^ set() only keeps one occurrence of each word
frequency[token] += 1
texts = [[token for token in text if frequency[token] > 2] for text in texts]
在这里使用defaultdict
没有错。但是,值得注意的是collections
模块有一个更适合手头任务的类。它被称为Counter
:
from collections import Counter
frequency = Counter()
for text in texts:
frequency.update(set(text))
texts = [[token for token in text if frequency[token] > 2] for text in texts]