我正在运行一个项目,该项目实现了一个两步过程,可用来预测个人是否会偿还其贷款。此项目旨在向我们介绍零售信用风险和个人日常借款,例如信用卡。
这两个步骤如下
对“已解决”的案例运行多元逻辑回归。也就是说,这些观察结果清楚地表明,因变量“治愈”为1,“清算”为0。 在本节中,我使用因素
现在,我有了一个模型,该模型过去是关于个人是否确实设法清算债务或宣布破产的信息。我将这种模型应用于目前无法偿还贷款的个人。
这样做的前提是用未结案件补充结案案件。因此,未结案件将有可能“还款”或“治愈”。
现在我的输入表是这样的
Resolution_status dependent_var weight X1 X2 X3 X4 X5
Resolved 1 1 30 1500 3 3
Resolved 0 1 15 750 1 1
----------------------------------------------------------------
Unresolved 1 0.6 5 500 6 6
Unresolved 0 0.4 5 500 6 6
我将未解决的案件分开,以确认每个观察都遵循这些规则 -每个未解决的观察都重复 -第一个被赋予治愈的“ 1”,权重等于模型在步骤1中估算的治愈可能性
使用权重声明有什么影响?我应该使用膨胀的零一beta回归还是分数logit模型?
我尝试使用SAShelp.baseball数据集运行上面的示例,以允许您运行它
/*Split the dataset into resolved and unresolved*/
DATA baseball_resolved
baseball_unresolved
;
SET sashelp.baseball
(KEEP = cr: logsalary);
IF NOT MISSING(logsalary) THEN DO;
IF logsalary > 6.5 THEN flag = 1;
ELSE flag = 0;
END;
IF NOT MISSING(logsalary) THEN OUTPUT baseball_resolved;
ELSE OUTPUT baseball_unresolved;
DROP logSalary;
RUN;
/*Predict the model on the resolved cases*/
PROC LOGISTIC DESCENDING
OUTMODEL = in_model_baseball
DATA = baseball_resolved
PLOTS(ONLY) = NONE;
MODEL flag (Event = '1') = cr:
/
SELECTION = NONE
LINK = LOGIT
;
RUN;
QUIT;
/*Apply the model to the unresolved cases*/
PROC LOGISTIC
INMODEL = in_model_baseball;
SCORE DATA = baseball_unresolved
OUT = unresolved_score
(KEEP = cr: p_1 p_0);
RUN;
/*Now output duplicate rows, with a weight attached*/
DATA unresolved_baseball_p_cure;
SET unresolved_score
(RENAME = (p_1 = weight));
flag = 1;
;
DROP p_0;
RUN;
DATA unresolved_baseball_p_non_cure;
SET unresolved_score
(RENAME = (p_0 = weight));
flag = 1;
;
DROP p_1;
RUN;
/*Attach a weight of 1 to all resolved cases*/
DATA baseball_resolved_weight;
SET baseball_resolved;
WEIGHT = 1;
RUN;
/*Merge the tables*/
DATA full_table
(rename = (weight = weight_var));
SET
baseball_resolved_weight
unresolved_baseball_p_cure
unresolved_baseball_p_non_cure;
RUN;
/*Run a logistic regression with weight*/
proc logistic
data = full_table;
model flag (EVENT = '1') = cr:;
weight weight_var;
RUN;
权重声明是否在我尝试的环境中起作用?我的目标实质上是对1和0进行逻辑回归,但要考虑到“未解决”的案例是重复的,并附加了“治愈的可能性”
答案 0 :(得分:1)
weight语句将权重应用于整个行。它既有独立的,也有附属的
例如,如果数据集中只有这四行,
Resolution_status dependent_var weight X1 X2 X3 X4 X5
Resolved 1 1 30 1500 3 3
Resolved 0 1 15 750 1 1
Unresolved 1 0.6 5 500 6 6
Unresolved 0 0.4 5 500 6 6
查看此问题的方式是:尽管实际上有4行,但出于所有计算目的,该数据集仅被理解为3(Sigma(weight)= 1 +1 + 0.6 + 0.4 = 3)行。
因此,当您在上述4个观测数据集上运行权重变量为'weight'的proc logistic时,从技术上讲,您将在以下方面建立logistic回归模型:
3个观察值,其中(dependent_var = 1)为1.6的观察值数;并且(dependent_var = 0)为1.4的观察次数;
权重还隐含在自变量(X1-X5)上。例如,如果要计算X1的平均值,则不再是(30 + 15 + 5 + 5)/ 4;而是(30 * 1 + 15 * 1 + 5 * 0.6 + 5 * 0.4)/ 3
从技术角度讲,这是重量。但是,关于您的前提的评论以及这种方法的有效性问题,在此我将不作评论,因为这取决于对您的案例和您的舒适度的理解,以及从信用风险角度做出的假设...
希望这对您有帮助...