我有一个信用评分数据集,需要对客户是否会违约进行分类。
LIMIT_BAL gender EDUCATION MARRIAGE AGE SEP_STATUS AUG_STATUS JUL_STATUS JUN_STATUS MAY_STATUS ... JUN_BAL MAY_BAL APR_BAL SEP_PAID AUG_PAID JUL_PAID JUN_PAID MAY_PAID APR_PAID default_0
0 20000 female bachelor married 24 2 mo 2 mo paid paid no need to pay ... 0 0 0 0 689 0 0 0 0 bad
1 90000 female bachelor single 34 using credit using credit using credit using credit using credit ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 good
dec_class= DecisionTreeClassifier(random_state=17)
y = df['default_0']
x = df.iloc[:, :-1]
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.3,random_state=17)
dec_class.fit(x,y)
could not convert string to float: 'female'
我认为决策树在分类和数值特征上都可以很好地工作。我已经将分类特征预处理为单词,之前它们都是数字。 为什么不接受与词相同的分类特征:性别-“男”,“女”?