在Tfidf矢量化器中将Vector.toarray()设为0

时间:2019-02-07 19:44:11

标签: nlp tf-idf tfidfvectorizer

我从互联网上下载了一个文本文件,我正在尝试清理并创建Tfidf向量。

下面是代码,我将数组中的所有数字都设为0(最终打印)。不知道是对还是错。

file = 'C:/Study/Machine Learning/Dataset/NLP_Data_s.txt'
text = open(file,'rt')
words = text.read()
text.close()
lower = str.lower(words)# convert all words to lower case
tokens = word_tokenize(lower)# tokenize words
table = str.maketrans("","",string.punctuation)# remove punctuation on 
tokens
remove_punct = [w.translate(table) for w in tokens]# remove punctuation on 
tokens
stop_words = set(stopwords.words('english'))
remove_stop = [word for word in remove_punct if not word in stop_words]# 
removed stop words
porter = PorterStemmer()
Stemmed = [porter.stem(word) for word in remove_stop]
vectorizer = TfidfVectorizer()
vectorizer.fit(Stemmed)
print(vectorizer.get_feature_names())
print(vectorizer.vocabulary_)
print(vectorizer.idf_)``
vector= vectorizer.transform(Stemmed)
print(vector.shape)
print(type(vector))
print(vector.toarray())

0 个答案:

没有答案