Python scikit学习如何为多类和多标签数据构建模型?

时间:2016-09-21 14:07:21

标签: python machine-learning scikit-learn text-classification multilabel-classification

我有这样的数据集:

Description  attributes.occasion.0 attributes.occasion.1    attributes.occasion.2   attributes.occasion.3   attributes.occasion.4

 descr01        Chanukah                Christmas               Housewarming        Just Because                Thank You
 descr02        Anniversary             Birthday                Christmas           Graduation                  Mother's Day
 descr03        Chanukah                Christmas               Housewarming        Just Because                Thank You
 descr04        Baby Shower             Birthday                Cinco de Mayo       Gametime                    Just Because
 descr05        Anniversary             Birthday                Christmas           Graduation                  Mother's Day

descr01 =>关于场合的描述(我刚刚在实际数据中设置了短名称,其全文描述)等等。

在上面的数据集中,我有一个独立的变量,它有文本描述和4个依赖的分类变量。

我尝试了随机森林分类器,它将多个依赖作为输入。

数据集的一个示例

    attributes.occasion.0   attributes.occasion.1   attributes.occasion.2   attributes.occasion.3   attributes.occasion.4
    Back to School                Birthday               School Events           NaN                      NaN


descrption:

Cafepress Personalized 5th Birthday Cowgirl Kids Light T-Shirt:100 percent cotton Youth T-Shirt by Hanes,Preshrunk, durable and guaranteed

以下是我尝试过的代码:

## Split  the dataset
X_train, X_test, y_train, y_test = train_test_split(df['Description'],df[['attributes.occasion.0','attributes.occasion.1','attributes.occasion.2','attributes.occasion.3','attributes.occasion.4']], test_size=0.3, random_state=0)

## Apply the model


    from sklearn.ensemble import RandomForestClassifier

    tfidf = Pipeline([('vect', HashingVectorizer(ngram_range=(1,7),non_negative=True)),

('tfidf', TfidfTransformer()),

])

def feature_combine(dataset):
    Xall = []
    i=1
    for col in cols_to_retain:
        if col != 'item_id' and col != 'last_updated_at':
            Xall.append(tfidf.fit_transform(dataset[col].astype(str)))

    joblib.dump(tfidf, "tfidf.sav")
    Xspall = scipy.sparse.hstack(Xall)

    #print Xspall
    return Xspall

def test_Data_text_transform_and_combine(dataset):
    Xall = []
    i=1

    for col in cols_to_retain:
        if col != 'item_id' and col != 'last_updated_at':
            Xall.append(tfidf.transform(dataset[col].astype(str)))

    Xspall = scipy.sparse.hstack(Xall)

    return Xspall

from sklearn.ensemble import RandomForestClassifier
text_clf = RandomForestClassifier()
_ = text_clf.fit(feature_combine(X_train), y_train)

RF_predicted = text_clf.predict(test_Data_text_transform_and_combine(X_test))

np.mean(RF_predicted  == y_test)*100 

当我计算精度测量值时,我的输出低于输出值?但我知道锄头可以解释这个结果以及如何绘制混淆矩阵和其他性能指标。

输出:

Accuracy for each dependent 

attributes.occasion.0    87.517672
attributes.occasion.1    96.050306
attributes.occasion.2    98.362394
attributes.occasion.3    99.184142
attributes.occasion.4    99.564090

可以告诉我如何处理多标签问题以及如何评估模型性能。在这种情况下有哪些可能的方法。我正在使用python sci-kit学习库。

谢谢, NIRANJAN

0 个答案:

没有答案