Question

使用R，可以在使用以下语法构建模型时忽略变量（列）：

model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)

当您的数据集包含索引或ID时，它非常方便。

假设你的数据是Pandas数据帧，你会如何在python中使用SKlearn？

Answer 1

所以这是我自己的代码，我过去常常在StackOverflow上做一些预测：

from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier

basic_feature_names = [ 'BodyLength'
                      , 'NumTags'
                      , 'OwnerUndeletedAnswerCountAtPostTime'
                      , 'ReputationAtPostCreation'
                      , 'TitleLength'
                      , 'UserAge' ]

fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now 
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])

因此，如果我们想要分类功能的子集，我可以做到这一点：

# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')

clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)

所以这里的想法是可以使用列表来选择pandas数据帧中列的子集，因此我们可以构造一个新列表或删除一个值并将其用于选择

我不确定如何使用numpy数组轻松完成此操作，因为索引的执行方式不同。

使用SKLearn构建模型时忽略列

1 个答案: