Question

我想通过Logistic回归预测情感分析模型的准确性，但会收到错误：不良的输入形状（由输入编辑）

数据框：

df
sentence                | polarity_label
new release!            | positive
buy                     | neutral
least good-looking      | negative

代码：

from sklearn.preprocessing import OneHotEncoder                                                   
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, 
ENGLISH_STOP_WORDS
# Define the set of stop words
my_stop_words = ENGLISH_STOP_WORDS
vect = CountVectorizer(max_features=5000,stop_words=my_stop_words)
vect.fit(df.sentence)
X = vect.transform(df.sentence)
y = df.polarity_label
encoder = OneHotEncoder()
encoder.fit_transform(y)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=123)
LogisticRegression(penalty='l2',C=1.0)

log_reg = LogisticRegression().fit(X_train, y_train)

错误消息

ValueError: Expected 2D array, got 1D array instead:
array=['Neutral' 'Positive' 'Positive' ... 'Neutral' 'Neutral' 'Neutral'].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.```

How can I fix this?

Answer 1

我认为您需要将y标签转换为一种热门编码，现在，您的标签向量可能像这样[0,1,0,0,1,0]，但是对于逻辑回归，您需要将其转换为这种形式[[0,1]，[1,0]，[0,1]，[0,1]]，因为在逻辑回归中，我们倾向于计算所有课程。

您可以使用sklearn onehotencoder，

from sklearn.preprocessing import OneHotEncoder                                                   
encoder = OneHotEncoder()
encoder.fit_transform(y)

Answer 2

例如，这样调整代码：

y = df.polarity_label

目前，您正尝试使用CountVectorizer将y转换为向量，该向量是根据句子数据进行训练的。

因此CountVectorizer具有以下词汇表（您可以使用vect.get_feature_names()来获得它）：

[“购买”，“好”，“外观”，“新”，“发布”]

并将包含这些单词的某些文本转换为矢量。

但是当您在仅包含单词positive, neutral, negative的y上使用此字符时，它找不到任何“已知”单词，因此y为空。

如果在转换后检查y，您还可以看到它为空：

<3x5 sparse matrix of type '<class 'numpy.int64'>'
    with 0 stored elements in Compressed Sparse Row format>

情绪分析逻辑回归中的不良输入形状

2 个答案: