我有这样的数据集:
Description attributes.occasion.0 attributes.occasion.1 attributes.occasion.2 attributes.occasion.3 attributes.occasion.4
descr01 Chanukah Christmas Housewarming Just Because Thank You
descr02 Anniversary Birthday Christmas Graduation Mother's Day
descr03 Chanukah Christmas Housewarming Just Because Thank You
descr04 Baby Shower Birthday Cinco de Mayo Gametime Just Because
descr05 Anniversary Birthday Christmas Graduation Mother's Day
descr01 =>关于场合的描述(我刚刚在实际数据中设置了短名称,其全文描述)等等。
在上面的数据集中,我有一个独立的变量,它有文本描述和4个依赖的分类变量。
我尝试了随机森林分类器,它将多个依赖作为输入。
数据集的一个示例
attributes.occasion.0 attributes.occasion.1 attributes.occasion.2 attributes.occasion.3 attributes.occasion.4
Back to School Birthday School Events NaN NaN
descrption:
Cafepress Personalized 5th Birthday Cowgirl Kids Light T-Shirt:100 percent cotton Youth T-Shirt by Hanes,Preshrunk, durable and guaranteed
以下是我尝试过的代码:
## Split the dataset
X_train, X_test, y_train, y_test = train_test_split(df['Description'],df[['attributes.occasion.0','attributes.occasion.1','attributes.occasion.2','attributes.occasion.3','attributes.occasion.4']], test_size=0.3, random_state=0)
## Apply the model
from sklearn.ensemble import RandomForestClassifier
tfidf = Pipeline([('vect', HashingVectorizer(ngram_range=(1,7),non_negative=True)),
('tfidf', TfidfTransformer()),
])
def feature_combine(dataset):
Xall = []
i=1
for col in cols_to_retain:
if col != 'item_id' and col != 'last_updated_at':
Xall.append(tfidf.fit_transform(dataset[col].astype(str)))
joblib.dump(tfidf, "tfidf.sav")
Xspall = scipy.sparse.hstack(Xall)
#print Xspall
return Xspall
def test_Data_text_transform_and_combine(dataset):
Xall = []
i=1
for col in cols_to_retain:
if col != 'item_id' and col != 'last_updated_at':
Xall.append(tfidf.transform(dataset[col].astype(str)))
Xspall = scipy.sparse.hstack(Xall)
return Xspall
from sklearn.ensemble import RandomForestClassifier
text_clf = RandomForestClassifier()
_ = text_clf.fit(feature_combine(X_train), y_train)
RF_predicted = text_clf.predict(test_Data_text_transform_and_combine(X_test))
np.mean(RF_predicted == y_test)*100
当我计算精度测量值时,我的输出低于输出值?但我知道锄头可以解释这个结果以及如何绘制混淆矩阵和其他性能指标。
输出:
Accuracy for each dependent
attributes.occasion.0 87.517672
attributes.occasion.1 96.050306
attributes.occasion.2 98.362394
attributes.occasion.3 99.184142
attributes.occasion.4 99.564090
可以告诉我如何处理多标签问题以及如何评估模型性能。在这种情况下有哪些可能的方法。我正在使用python sci-kit学习库。
谢谢, NIRANJAN