Question

有一个很好的introduction on how to use sklearn for text analytics。

但是在上部教程中，他们使用sklearn的数据集与他们的＆＃39;束＆＃39;没有详细说明的对象，因此我很难将我的数据放在所需的表单中，以便在其上使用sklearn方法。我想在我的文本数据上使用CountVectorizer()进行进一步处理，但调用CountVectorizer.fit_transform（my_string_array）总会引发一些错误：

AttributeError：＆＃39; list＆＃39;对象没有属性＆＃39; lower＆＃39;

到目前为止，我已经尝试初始化以下numpy数组类型并将我的字符串加载到它们中，但它们都没有工作：

np.chararray（shape）
np.empty（shape，dtype = str / obj）

Answer 1

简化示例：

The sum is: 6 for Key: the Range A1:A5
The sum is: 5 for Key: the Range B1:B5
The sum is: 12 for Key: the Range C1:C5

from sklearn.feature_extraction.text import CountVectorizer docs = ['This is the first document', 'This is the second document'] count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(docs)应该是字符串的集合，即列表，numpy数组等。

如果文本已经被标记化，那么您需要告诉docs它不需要拆分字符串：

CountVectorizer

Answer 2

CountVectorizer需要一个序列或一个字符串列表来处理mentioned here：

输入：string {'filename'，'file'，'content'}

如果是'filename'，那么作为参数传递给fit的序列应该是一个文件列表，需要读取以获取要分析的原始内容。

如果是'file'，则序列项必须具有'read'方法（类文件对象），该方法被调用以获取内存中的字节。

否则输入应为序列字符串，或者预期直接分析字节项。

您正在提供[['string1', 'string2', ....], ['string1', 'string2', ....]。外部是一个数组，因此需求已经完成。

然后CountVectorizer（）迭代您提供的列表的元素。

它期望：object of type string，并调用lower()（以制作小写字符串）。得到：['string1', 'string2', ....]这是一个列表，显然没有lower（）方法。因此错误。

<强>解决方案：在我看来，如果不是字符串列表的列表，它将不会改变结果，只使用一个传递给CountVectorizer（）的列表。

通过执行以下操作，将字符串的内部列表（您正在使用的每个文档列表）单个字符串：

data = [" ".join(x) for x in data]

其中data是您的字符串数据，包含字符串列表。

假设您的数据是：

data = [['yo', 'dude'],['how','are', 'you']]
data = [" ".join(x) for x in data]

输出：

['yo dude', 'how are you']

现在可以将此传递给CountVectorizer，没有任何错误。

如何为sklearn的CountVectorizer（）使用自定义文本数据格式？

2 个答案: