应用错误收集

在停止字删除并在CountVectorizer中查找双字组之后或之前是否应用了max_df和min_df限制

时间：2018-08-29 21:43:23

标签： scikit-learn nltk n-gram stop-words countvectorizer

我是nltk的新手。我试图了解CountVectorizer中各种参数的执行顺序。

标记化-例如，自定义标记化会删除少于3个字符的单词。默认情况下，CountVextorizer允许带连字符和下划线的单词，例如2015年8月，GPA 3.9等。
处理大小写
删除停用词
根据文档频率删除单词-max_df和min_df
寻找二元组
加标-如果作为自定义标记化定义的一部分或通过本文https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn中给出的分析器添加

0 个答案:

没有答案