我想阅读一个.txt文件,然后使用scikit-learn应用一些预处理,例如假设我想要矢量化(即从某些文本制作矢量表示)我尝试使用此脚本的一些文本但是我无法将此类预处理应用于桌面上的所选文件(.txt)。
这就是我所做的:
# -- coding: utf-8 --
from Tkinter import Tk
from tkFileDialog import askopenfilename
from sklearn.feature_extraction.text import CountVectorizer
vectorizer= CountVectorizer(min_df=1)
Tk().withdraw
opinion_filename = askopenfilename()
opinion_filename = askopenfilename()
if opinion_filename:
with open(opinion_filename) as opinion_file:
X = vectorizer.fit_transform(opinion_file)
print("This is the name of the filename:",opinion_filename)
print ("This is the vectorized filename ",X)
else:
# user might select no file and hit cancel the file open dialog
pass
这是输出:
('This is the name of the filename:', '/Users/user/Desktop/opinion_prueba.txt')
('This is the vectorized filename ', <1x22 sparse matrix of type '<type 'numpy.int64'>'
with 22 stored elements in Compressed Sparse Row format>)
我想返回.txt文件的矢量表示。
答案 0 :(得分:0)
方法askopenfilename
返回文件名(作为字符串),而不是可以从中读取数据的文件对象。因此,您需要更新代码以从文件名中打开文件对象,然后将该文件对象传递给vectorizer.fit_transform
。
opinion_filename = askopenfilename()
if opinion_filename:
with open(opinion_filename) as opinion_file:
X = vectorizer.fit_transform(opinion_file)
# rest of the code ...
else:
# user might select no file and hit cancel the file open dialog
pass