Question

我有这段代码：

# Remove words that appear less than X (e.g. 2) time(s)
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

现在这会过滤掉所有令牌，其中术语频率（所有文本中的总出现次数）低于2，或者文档频率（带有一次或多次的文本总数）低于2？

修改

# Get term frequencies (how many times a term occurs no matter what)

from collections import Counter
termfrequency = Counter()
for text in texts:
    for token in text:
        termfrequency[token] +=1

texts = [[token for token in text if termfrequency[token] > 2] for text in texts]

# Get document frequencies (in how many documents a term exists > 0 times)

from collections import Counter
documentfrequency = Counter()
for text in texts:
    documentfrequency.update(set(text))

texts = [[token for token in text if documentfrequency[token] > 2] for text in texts]

Answer 1

[我想计算]整个集合中出现一个单词的文档数量，无论它出现在任何特定文档中的次数。

这是一种方法：

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in set(text):
               # ^^^ set() only keeps one occurrence of each word
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

在这里使用defaultdict没有错。但是，值得注意的是collections模块有一个更适合手头任务的类。它被称为Counter：

from collections import Counter
frequency = Counter()
for text in texts:
    frequency.update(set(text))
texts = [[token for token in text if frequency[token] > 2] for text in texts]

按文档频率删除令牌

1 个答案: