Question

我的目录中有.txt个文件（文件）。首先我load文档并删除一些括号并删除一些引号，因此文档如下所示：

document1:
is a scientific discipline that explores the construction and study of algorithms that can learn from data Such algorithms operate by building a model

document2:
Machine learning can be considered a subfield of computer science and statistics It has strong ties to artificial intelligence and optimization which deliver methods

所以我正在加载目录中的文件，如下所示：

preprocessDocuments =[[' '.join(x) for x in sample[:-1]] for sample in load(directory)]


documents = ''.join( i for i in ''.join(str(v) for v
                                              in preprocessDocuments) if i not in "',()")

然后，我试图对document1和document2进行矢量化，以便创建如下的训练矩阵：

from sklearn.feature_extraction.text import HashingVectorizer
vectorizer = HashingVectorizer(analyzer='word')
X = HashingVectorizer.fit_transform(documents)
X.toarray()

然后这是输出：

    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

如何创建一个矢量表示？我以为我在documents中携带了加载的文件，但似乎文件无法安装。

Answer 1

documents的内容是什么？ It looks like它应该是带有标记的文件名或字符串的列表。此外，你应该使用对象调用fit_transform，而不是像静态方法，i。即vectorizer.fit_transform(documents)。

例如，这在这里工作：

from sklearn.feature_extraction.text import HashingVectorizer
documents=['this is a test', 'another test']
vectorizer = HashingVectorizer(analyzer='word')
X = vectorizer.fit_transform(documents)

在scikit-learn中拟合词汇的问题？

1 个答案: