Question

我想以vector（.toarray（））的形式创建文本文件包的单词表示。我正在使用代码：

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(input="file")
f = open('D:\\test\\45.txt')
bag_of_words = vectorizer.fit_transform([f])
print(bag_of_words)

我想使用countvectorizer的词汇进行比较。我有一个文本文件，我将其标记为并希望将其用作词汇表。怎么做？

Answer 1

鉴于标记化是通过在单个标记之间插入空格来完成的，因此从文本创建词汇表非常简单：

f = open('foo.txt')
text = f.read() # text is a string
tokens = text.split() # breaks the string in single tokens
vocab  = list(set(tokens)) # set() removes the doubles form the token list

检查导入文件的词频对词汇python

1 个答案: