Question

import gensim
corpus = [["a","b","c"],["a","d","e"],["a","f","g"]]
from gensim.corpora import Dictionary
dct = Dictionary(corpus)
print(dct)
dct.filter_extremes(no_below=1)
print(dct)

当我运行上面的代码时，我的输出是-

Dictionary(7 unique tokens: ['a', 'b', 'c', 'd', 'e']...)
Dictionary(6 unique tokens: ['b', 'c', 'd', 'e', 'f']...)

我认为，由于“ a”出现在两个文档中，因此不应将其删除。然而，这种情况并非如此。我想念什么吗？

Answer 1

看documentation of filter_extremes：

filter_extremes(no_below=5, no_above=0.5, keep_n=100000, keep_tokens=None)

Notes:    
This removes all tokens in the dictionary that are:

    1. Less frequent than no_below documents (absolute number, e.g. 5) or
    2. More frequent than no_above documents (fraction of the total corpus size, e.g. 0.3).
    3. After (1) and (2), keep only the first keep_n most frequent tokens (or keep all if keep_n=None).

您只通过了no_below=1。这意味着出现在少于1个文档（共3个）中的令牌将被删除。这意味着a以及您的语料库中的任何其他标记都将留下。

但是根据您的默认值检查no_above=0.5，因为您没有为此关键字传递显式值。这意味着将删除出现在超过50％的文档中的令牌（在3个文档中，即至少在2个文档中出现）。 'a'出现在所有3个文档中，事实上，它是至少出现在至少2个文档中的唯一一个。这就是从结果中删除此令牌和仅此令牌的原因。（在您的示例情况下，keep_n的默认10000值表示第3步为空操作。）

如果仅只想剥离低频极值令牌，请将显式no_above=1.0传递给filter_extremes。

对gensim中filter_extreme的使用有误解

1 个答案: