我正在处理一个包含5个变量和~90k观测值的小数据集。我已经尝试拟合一个模仿http://blog.yhathq.com/posts/random-forests-in-python.html的虹膜示例的随机森林分类器。但是,我的挑战是我的预测值都是相同的:0。我是Python的新手,但熟悉R.不确定这是否是编码错误,或者这是否意味着我的数据是垃圾。
from sklearn.ensemble import RandomForestClassifier
data = train_df[cols_to_keep]
data = data.join(dummySubTypes.ix[:, 1:])
data = data.join(dummyLicenseTypes.ix[:, 1:])
data['is_train'] = np.random.uniform(0, 1, len(data)) <= .75
#data['type'] = pd.Categorical.from_codes(data['type'],["Type1","Type2"])
data.head()
Mytrain, Mytest = data[data['is_train']==True], data[data['is_train']==False]
Myfeatures = data.columns[1:5] # string of feature names: subtype dummy variables
rf = RandomForestClassifier(n_jobs=2)
y, _ = pd.factorize(Mytrain['type'])
rf.fit(Mytrain[Myfeatures], y)
data.target_names = np.asarray(list(set(data['type'])))
preds = data.target_names[rf.predict(Mytest[Myfeatures])]
一个类的预测,类型1:
In[583]: pd.crosstab(Mytest['type'], preds, rownames=['actual'], colnames ['preds'])
Out[582]:
preds Type1
actual
Type1 17818
Type2 7247
更新: 前几行数据:
In[670]: Mytrain[Myfeatures].head()
Out[669]:
subtype_INDUSTRIAL subtype_INSTITUTIONAL subtype_MULTIFAMILY \
0 0 0 0
1 0 0 0
2 0 0 0
3 0 0 0
4 0 0 0
subtype_SINGLE FAMILY / DUPLEX
0 0
1 0
2 0
3 1
4 1
当我预测训练输入时,我只得到一个类的预测:
In[675]: np.bincount(rf.predict(Mytrain[Myfeatures]))
Out[674]: array([ 0, 75091])
答案 0 :(得分:3)
您的代码存在一些问题,但最明显的问题是:
data.target_names = np.asarray(list(set(data['type'])))
preds = data.target_names[rf.predict(Mytest[Myfeatures])]
Python中的集本身就是无序的,因此无法保证在此操作之后预测将被正确标记。
以下是您的代码的清理版本:
# build your data
data = train_df[cols_to_keep]
data = data.join(dummySubTypes.ix[:, 1:])
data = data.join(dummyLicenseTypes.ix[:, 1:])
# split into training/testing sets
from sklearn.cross_validation import train_test_split
train, test = train_test_split(data, train_size=0.75)
# fit the classifier; scikit-learn factorizes labels internally
features = data.columns[1:5]
target = 'type'
rf = RandomForestClassifier(n_jobs=2)
rf.fit(train[features], train[target])
# predict and compute confusion matrix
preds = rf.predict(test[features])
print(pd.crosstab(test[target], preds,
rownames=['actual'],
colnames=['preds']))
如果结果仍然没有达到预期效果,我建议您使用scikit-learn grid_search
工具对随机森林进行一些超参数优化。