应用错误收集

我想计算整个语料库中的术语频率。为此，有两种方法，如下所示使用CountVectorizer和axis=0中的sum。

count_vec = CountVectorizer(tokenizer=cab_tokenizer, ngram_range=(1,2), stop_words=stopwords)
cv_X = count_vec.fit_transform(string_list)

另一种方法是使用WordCloud.process_text()（请参阅文档here），这将导致词频dict。我使用TfIdfVectorizer使用了先前tfidf_vec.get_stop_words()的停用词。

text_freq = WordCloud(stopwords=stopwords, collocations=True).process_text(text)

我使用TfIdfVectorizer中的停用词这一事实，我希望它的行为相同，但是，我得到的功能/术语有所不同（字典的长度小于{{1} }。

所以，我想知道，一个接一个地使用又有什么不同？一个比另一个准确吗？