Format of train/test for random forest classifier with categorical variables

时间:2019-01-15 18:26:10

标签: python pandas scikit-learn

Updated: How do I set up my train/test df for scikit randomforestclassifier for multiple categories? How do I predict?

My training dataset has a categorical Outcome column with 4 classes and I want to predict which of those four is most likely for my test data. Looking at other questions, I tried use pandas get_dummies to encode four new columns into the original df in place of the original Outcome column but wasn't sure how to indicate to the classifier that those four columns were the categories, so I used y = df_raw['Outcomes'].values .

I then split the training set 80/20 and called fit() with these x_train, x_valid and y_train, y_valid:

def split_vals(a,n): return a[:n].copy(), a[n:].copy() 
n_valid = 10000 
n_trn = len(df_raw_dumtrain)-n_valid
raw_train, raw_valid = split_vals(df_raw_dumtrain, n_trn)
X_train, X_valid = split_vals(df_raw_dumtrain, n_trn)
y_train, y_valid = split_vals(df_raw_dumtrain, n_trn)

random_forest = RandomForestClassifier(n_estimators=10)
random_forest.fit(X_train, y_train)
Y_prediction = random_forest.predict(X_train)

I tried running fit() as:

test_pred = random_forest.predict(df_test)

But I get an error:

ValueError: Number of features of the model must match the input. Model n_features is 27 and input n_features is 28

How should I be configuring my test set?

1 个答案:

答案 0 :(得分:0)

您必须从测试数据中删除目标变量,然后将数据框的其余列作为预测函数的输入。您将能够解决功能不匹配的数量。

尝试一下!

random_forest.predict(df_test.drop('Outcomes',axis=1))

注意:您不必使用随机森林或任何基于决策树的模型来创建目标变量的伪变量。