如何使用多个分类变量创建Python R样式的预测模型

时间:2017-01-21 06:28:18

标签: python r machine-learning model prediction

您是否知道如何为Ensemble方法创建一个预测模型特别是R风格的分类器:

ded.fit(formula="X ~ Y + Z**2", data=fed)

目前代码看起来像这样:

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100, min_samples_leaf=10,    
random_state=1)
model.fit(x_train, y_train)

您可能会问我为什么需要这个问题?

  1. 我需要这个来添加更多的变量而不仅仅是X和Y我还需要Z和P以及Q和R.
  2. 我需要像在R中那样进行观察和实验,如果添加指数或将值乘以或除以特定变量会增加或降低预测的准确性,如下面的公式:

    X ~ Y + Z^2" or "X ~ Y + Z + (P*2) + Q**2

  3. 任何答案都将受到高度赞赏。 提前谢谢。

2 个答案:

答案 0 :(得分:2)

以下是我尝试这样做的方法,使用虚构的pandas df,其中3列由您的分类变量和一列作为您的目标{cat1,cat2,cat3,target}:

predictors =df[["cat1", "cat2", "cat3"]]
target = df["target"]

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

'''let sklearn do your training/testing split'''
pred_train, tar_train, pred_test, tar_test(predictors, target, test_size = .30)

'''create model with pre-pruning--play with the parameters consulting documentation'''
numtrees = 50
classifier=RandomForestClassifier(n_estimators = numtrees,min_samples_leaf = 10,
                                  max_leaf_nodes = 25)
model=classifier.fit(pred_train,tar_train)
predictions=model.predict(pred_test)

'''To test the results'''
import sklearn.metrics

print '\n********* confusion matrix **********\n'
print "TRUE NEG   FALSE POS"
print '', sklearn.metrics.confusion_matrix(tar_test,predictions)
print "FALSE NEG   TRUE POS"

print '\n============ Accuracy ============='
print sklearn.metrics.accuracy_score(tar_test, predictions)

enter image description here

enter image description here

请记住,我不是一位经验丰富的程序员 - 但上面的代码对我有用。

答案 1 :(得分:1)

以下内容应该有效:

import pandas as pd
import numpy as np
X = pd.DataFrame(np.random.randint(0,100,size=(100, 2)), columns=list('XZ'))
y = np.random.randint(2,size=100) # labels for binary classification
X['Z2'] = X.Z**2    # add more features
print X.head() # note the added feature Z^2
#    X   Z    Z2
#0  88  90  8100
#1  49  63  3969
#2  27  23   529
#3  47  71  5041
#4  21  98  9604
train_samples = 80  # Samples used for training the models
X_train = X[:train_samples]
X_test = X[train_samples:]
y_train = y[:train_samples]
y_test = y[train_samples:]
from sklearn.ensemble import RandomForestClassifier
from pandas_ml import ConfusionMatrix
import matplotlib.pyplot as plt
model = RandomForestClassifier(n_estimators=100, min_samples_leaf=10, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
#print confusion_matrix(y_test, y_pred)
cm = ConfusionMatrix(y_test, y_pred)
print cm
# Predicted  0   1  __all__
# Actual
# 0          3   4        7
# 1          4   9       13
# __all__    7  13       20
cm.plot()
plt.show()

enter image description here