我正在投票分类器中尝试几个sklearn分类器进行整合。
要进行测试,我有一个数据框,其中包含一组代表工具技能的列(0到10的数值代表人们对该技能的了解程度)和一个“适合工作”列,它是类变量。示例:
import pandas as pd
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"])
total_mock_samples= 100
for i in range(total_mock_samples):
df=df.append(mockResults(df.columns, 'Fit to Job', good_values=i > total_mock_samples/2), ignore_index=True)
#Fills dataframe with mock data
#Output like:
print(np.array(df))
#[[1. 3. 6. 1.]
# [3. 2. 3. 0.]
# [1. 4. 0. 0.]
# ...
# [7. 8. 8. 1.]
# [8. 7. 9. 1.]]
然后我安装了集成分类器:
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import numpy as np
X = np.array(df[df.columns[:-1]])
y = np.array(df[df.columns[-1]])
rfc = RandomForestClassifier(n_estimators=10)
svc = SVC(kernel='linear')
knn = KNeighborsClassifier(n_neighbors=5)
nb = GaussianNB()
lr = LinearRegression()
ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])
最后,我尝试使用交叉验证来评估它,就像这样:
cval_score = cross_val_score(ensemble, X, y, cv=10)
但是我遇到以下错误:
TypeError Traceback (most recent call last)
<ipython-input-13-f7c01fa872d2> in <module>
182 ensemble = VotingClassifier(estimators=[("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc), ("Linear Reg.",lr)])
183
--> 184 cval_score = cross_val_score(ensemble, X, y, cv=10)
[...]
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'
我检查了其他答案,但它们都涉及numpy数据转换。该错误在交叉验证阶段内发生。我没有运气就尝试应用他们的解决方案。
在计算分数之前,我也曾尝试更改数据类型,但没有成功。
也许有人更敏锐地注视着问题所在。
编辑01:模拟结果生成器功能
def mockResults(columns, result_column_name='Fit', min_value = 0, max_value=10, good_values=False):
mock_res = {}
for column in columns:
mock_res[column] = 0
if column == result_column_name:
if good_values == True:
mock_res[column] = float(1)
else:
mock_res[column] = float(0)
elif good_values == True:
mock_res[column] = float(random.randrange(int(max_value*0.7), max_value))
else:
mock_res[column] = float(random.randrange(min_value, int(max_value*0.5)))
return mock_res
答案 0 :(得分:1)
df = pd.DataFrame(columns=["Python", "Scikit-learn", "Pandas", "Fit to Job"], data=np.random.randint(1, 10,size=(400,4)))
class LinearRegressionInt(LinearRegression):
def predict(self,X):
predictions = self._decision_function(X)
return np.asarray(predictions, dtype=np.int64).ravel()
...
lr = LinearRegressionInt()
...
ensemble = VotingClassifier(estimators=[("lr",lr),("Random forest", rfc), ("KNN",knn), ("Naive Bayes", nb), ("SVC",svc)] )
cval_score = cross_val_score(ensemble, X, y, cv=10)
cval_score
array([ 0.09090909, 0.11904762, 0.17073171, 0.14634146, 0.17073171,
0.15384615, 0.07692308, 0.15384615, 0.10810811, 0.08108108])