在以下来自“记分卡”软件包文档的示例中,所有变量都进行了装箱。但是,如果我查看针对“ age.in.years”的建议分类,则默认比率作为年龄的函数遵循过山车模式(您可以查看图表或查看“ badprob”列)。我们是否可以为随着年龄增加而降低违约率(证据权重增加)施加条件,从而使分级的信息价值最大化?有什么想法吗?
非常感谢
library(scorecard)
# data preparing ------
# load germancredit data
data("germancredit")
# filter variable via missing rate, iv, identical value rate
dt_f = var_filter(germancredit, y="creditability")
# breaking dt into train and test
dt_list = split_df(dt_f, y="creditability", ratio = 0.6, seed = 30)
label_list = lapply(dt_list, function(x) x$creditability)
# woe binning ------
bins = woebin(dt_f, y="creditability")
> bins$age.in.years
variable bin count count_distr good bad badprob woe
1: age.in.years [-Inf,26) 190 0.190 110 80 0.4210526 0.5288441
2: age.in.years [26,28) 101 0.101 74 27 0.2673267 -0.1609304
3: age.in.years [28,35) 257 0.257 172 85 0.3307393 0.1424546
4: age.in.years [35,37) 79 0.079 67 12 0.1518987 -0.8724881
5: age.in.years [37, Inf) 373 0.373 277 96 0.2573727 -0.2123715
bin_iv total_iv breaks is_special_values
1: 0.057921024 0.1304985 26 FALSE
2: 0.002528906 0.1304985 28 FALSE
3: 0.005359008 0.1304985 35 FALSE
4: 0.048610052 0.1304985 37 FALSE
5: 0.016079553 0.1304985 Inf FALSE