Question

编辑：这是我最终要问的问题：Understanding min_df and max_df in scikit CountVectorizer

我正在阅读scikit-learn CountVectorizer的文档，并注意到在讨论max_df时，我们关注令牌的文档频率：

max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

但是当我们考虑max_features时，我们对词汇频率感兴趣：

max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

我很困惑：如果我们使用max_df，并说我们将其设置为10，我们就不会说，＆＃34;忽略任何显示超过10次的令牌＆＃34; ？

如果我们将max_features设置为100，我们就不会说，＆＃34;只使用整个语料库中出现次数最多的100个令牌＆＃34; ？

如果我做对了......那么在使用＆＃39;术语频率＆＃39;时措辞之间的区别是什么？和＆＃39;记录频率＆＃39;？

Answer 1

当您将max_df设置为10时，您会说“忽略显示在10个以上文档中的任何令牌”..在这里您不会考虑令牌在每个文档中出现的次数，只是它出现的文件数量。

当你将max_features设置为100时，它意味着“通过corupus中的术语频率对令牌（按降序排序）进行排序（这意味着令牌在整个语料库中出现在每个文档中的次数），然后只考虑那些令牌中的前100个“

＆＃39;术语频率＆＃39;之间的区别是什么？和＆＃39;记录频率＆＃39;？

1 个答案: