Question

我正在使用电子商务数据集来预测产品类别。我将产品描述和供应商代码用作功能，并预测产品类别。

from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import ensemble

df['joined_features'] = df['description'].astype(str) + ' ' + df['supplier'].astype(str) 

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['joined_features'], df['category'])

# encode target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# count vectorizer object 
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(df['joined_features'])

# transform training and validation data
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

classifier = ensemble.RandomForestClassifier()
classifier.fit(xtrain_count, train_y)
predictions = classifier.predict(feature_vector_valid)

通过此预测，我获得了约90％的准确性。我现在想预测更多类别。这些类别是分层的。我预测的类别是主要类别。我想预测更多。

例如，我预测了衣服。现在我要预测：服装->鞋子

我尝试加入两个类别：df['category1'] + df['category2']并预测它们为一个类别，但是我得到了2％左右的准确度，这确实很低。

以分层方式创建分类器的正确方法是什么？

编辑：为了更好的理解，我编译了一些虚假数据：

从第一行开始：类别1对应三星，类别3对应电子产品，类别7对应电视。

Answer 1

一个想法可能是使用所有第2级类别构建模型，但将类别1的预测概率作为输入特征输入模型。

另一个想法是，仅针对category1 ==服装训练针对category2的模型。理想情况下，根据category1的预测，可以有条件地调用其他多类模型。显然，这会增加您要做的工作量，具体取决于类别1的数量。

将多类分类器转换为分层多类分类器

1 个答案: