如何基于多个x变量预测y变量?

时间:2019-04-16 19:01:53

标签: python python-3.x scikit-learn

我正在测试这样的代码。

from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
#Seaborn for easier visualization
import seaborn as sns

# Load Iris Flower Dataset
# Load data
df = pd.read_csv('C:\\path_to_file\\train.csv')
df.shape
list(df)


# the model can only handle numeric values so filter out the rest
# data = df.select_dtypes(include=[np.number]).interpolate().dropna()

df1 = df.select_dtypes(include=[np.number])
df1.shape
list(df1)
df1.dtypes


df1 = df1.fillna(0)

#Prerequisites
import numpy as np
import pandas as pd

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split


#Split train/test sets
# y = df1.SalePrice
X = df1.drop(['index'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=.33)


# Train model
clf = RandomForestRegressor(n_jobs=2, n_estimators=1000)
model = clf.fit(X_train, y_train)


# Feature Importance
headers = ['name', 'score']
values = sorted(zip(X_train.columns, model.feature_importances_), key=lambda x: x[1] * -1)
print(tabulate(values, headers, tablefmt='plain'))


(pd.Series(model.feature_importances_, index=X.columns)
   .nlargest(10)
   .plot(kind='barh'))

enter image description here

这对我在网上找到的一些示例数据很好用。现在,而不是将销售价格作为我的y变量。我试图弄清楚如何使模型做出诸如target = TrueTarget = False之类的预测,否则我的方法可能是错误的。

对我来说有点困惑,因为这行:df1 = df.select_dtypes(include=[np.number])。因此,仅包含数字,这对于RandomForestRegressor classifier是有意义的。我只是在这里寻找有关如何处理非数字预测的指导。

1 个答案:

答案 0 :(得分:0)

您在这里处理2类(正确,错误)的分类问题。首先,请看一个简单的逻辑回归模型。

https://en.wikipedia.org/wiki/Logistic_regression

由于您使用的是sklearn,请尝试:

https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html