Question

我正在尝试将CountVectorizer模块与Sci-kit Learn一起使用。从我读到的内容来看，似乎可以在句子列表中使用，例如：

['这是第一个文件。'，'这是第二个第二个文件。'，'和第三个文件。'，'这是第一个文件吗？'

但是，有没有办法以列表形式对一组单词进行矢量化，例如[['this'，'is'，'text'，'document'，'to'，'analyze']，['和'，'这个'，'是'，''，'第二''，['和'，'这'，'和'，'那'，'是'，'第三']？

我正在尝试使用' '.join(wordList)将每个列表转换为句子，但我收到错误：

TypeError：序列项13329：期望的字符串或Unicode，生成器结果

当我尝试跑步时：

vectorizer = CountVectorizer(min_df=50)
ratings = vectorizer.fit_transform([' '.join(wordList)])

谢谢！

Answer 1

我想你需要这样做：

counts = vectorizer.fit_transform(wordList)  # sparse matrix with columns corresponding to words
words = vectorizer.get_feature_names()  # array with words corresponding to columns

最后，获得[['this', 'is', 'text', 'document', 'to', 'analyze']]

sample_idx = 1
sample_words = [words[i] for i, count in 
                enumerate(counts.toarray()[sample_idx]) if count > 0]

如何对python的单词列表进行矢量化？

1 个答案: