我正在测试这样的代码。
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns
# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)
# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()
df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes
df1 = df1.fillna(0)
#Prerequisites
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)
# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)
# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))
(pd.Series(model.feature_importances_, index=X.columns)
.nlargest(10)
.plot(kind='barh'))
这对我在网上找到的一些示例数据很好用。现在,而不是将销售价格作为我的y
变量。我试图弄清楚如何使模型做出诸如target = True
或Target = False
之类的预测,否则我的方法可能是错误的。
对我来说有点困惑,因为这行:df1 = df.select_dtypes(include=[np.number])
。因此,仅包含数字,这对于RandomForestRegressor classifier
是有意义的。我只是在这里寻找有关如何处理非数字预测的指导。
答案 0 :(得分:0)
您在这里处理2类(正确,错误)的分类问题。首先,请看一个简单的逻辑回归模型。
https://en.wikipedia.org/wiki/Logistic_regression
由于您使用的是sklearn,请尝试:
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html