我正在使用推文的训练和测试数据集,这些推文被组合在一起。 (combi = train.append(测试,ignore_index = True)。
受训的CSV带有手动标记的情绪:-1、0和1(基本上是负面,中立和正面),而测试没有。
我希望代码使用逻辑回归输出f1分数,但是在以下位置出现问题:使用了f1_score(yvalid,projection_int):
我的代码如下:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
bow = bow_vectorizer.fit_transform(combi['tidy_tweet'])
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_df=0.90, min_df=2, max_features=1000, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(combi['tidy_tweet'])
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
train_bow = bow[:1300,:]
test_bow = bow[1300:,:]
xtrain_bow, xvalid_bow, ytrain, yvalid = train_test_split(train_bow, train['label'], random_state=42, test_size=0.3)
lreg = LogisticRegression()
lreg.fit(xtrain_bow, ytrain) # training the model
prediction = lreg.predict_proba(xvalid_bow)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
f1_score(yvalid, prediction_int)
答案 0 :(得分:0)
阅读相关的documentation,您将看到average
中参数f1_score
的默认值为binary
;因为这里没有指定它,所以它采用了默认值,但是这对于您的多类分类是无效的(同意,这可能是一个错误的设计选择)。
如错误消息所建议,您应该显式选择并指定文档中显示的其他可用参数之一;这是带有虚拟多类数据的文档中的示例:
from sklearn.metrics import f1_score
# dummy multi-class data, similar to yours:
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')
# 0.26666666666666666
f1_score(y_true, y_pred, average='micro')
# 0.33333333333333331
f1_score(y_true, y_pred, average='weighted')
# 0.26666666666666666
f1_score(y_true, y_pred)
# ValueError: Target is multiclass but average='binary'. Please choose another average setting.