Question

我正在使用scikit-learn进行文字处理，但我的CountVectorizer并未提供我期望的输出。

我的CSV文件如下：

"Text";"label"
"Here is sentence 1";"label1"
"I am sentence two";"label2"

等等。

所以我想首先使用Bag of Words来理解python中的SVM是如何工作的。

import pandas as pd
from sklearn import svm
from sklearn.feature_extraction.text import CountVectorizer

data = pd.read_csv(open('myfile.csv'),sep=';')

target = data["label"]
del data["label"]

# Creating Bag of Words
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data)
X_train_counts.shape 
count_vect.vocabulary_.get(u'algorithm')

当我做的时候

print(X_train_counts.shape)

我看到输出(1,1)，而我有1048行句子。比我看看

的输出

count_vect.vocabulary_.get(u'algorithm')

这是None。

你能告诉我，我做错了吗？我正在关注this教程。

Answer 1

问题出在eval。该函数需要一个产生字符串的iterable。不幸的是，这些是错误的字符串，可以通过一个简单的例子进行验证。

out(x) = sprintf("set output '%s.pdf'", x)
eval(out("file"))

只打印列名; iterating给出列而不是count_vect.fit_transform(data)的值。你应该这样做：

for x in data:
    print(x)
# Text

带有Pandas数据帧的CountVectorizer

1 个答案: