按文档频率删除令牌

时间:2017-09-03 08:19:11

标签: python collections

我有这段代码:

# Remove words that appear less than X (e.g. 2) time(s)
from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in text:
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

现在这会过滤掉所有令牌,其中术语频率(所有文本中的总出现次数)低于2,或者文档频率(带有一次或多次的文本总数)低于2?

修改

# Get term frequencies (how many times a term occurs no matter what)

from collections import Counter
termfrequency = Counter()
for text in texts:
    for token in text:
        termfrequency[token] +=1

texts = [[token for token in text if termfrequency[token] > 2] for text in texts]

# Get document frequencies (in how many documents a term exists > 0 times)

from collections import Counter
documentfrequency = Counter()
for text in texts:
    documentfrequency.update(set(text))

texts = [[token for token in text if documentfrequency[token] > 2] for text in texts]

1 个答案:

答案 0 :(得分:0)

  

[我想计算]整个集合中出现一个单词的文档数量,无论它出现在任何特定文档中的次数。

这是一种方法:

from collections import defaultdict
frequency = defaultdict(int)
for text in texts:
    for token in set(text):
               # ^^^ set() only keeps one occurrence of each word
        frequency[token] += 1

texts = [[token for token in text if frequency[token] > 2] for text in texts]

在这里使用defaultdict没有错。但是,值得注意的是collections模块有一个更适合手头任务的类。它被称为Counter

from collections import Counter
frequency = Counter()
for text in texts:
    frequency.update(set(text))
texts = [[token for token in text if frequency[token] > 2] for text in texts]