proc logistic中的权重语句会影响自变量还是因变量?

时间:2019-05-16 13:06:06

标签: sas regression logistic-regression mixed-models

我正在运行一个项目,该项目实现了一个两步过程,可用来预测个人是否会偿还其贷款。此项目旨在向我们介绍零售信用风险和个人日常借款,例如信用卡。

这两个步骤如下

  1. 对“已解决”的案例运行多元逻辑回归。也就是说,这些观察结果清楚地表明,因变量“治愈”为1,“清算”为0。 在本节中,我使用因素

    • 每日经常账户更改/数量
    • 长期债务总计
    • 信贷利用
    • 自上次付款以来的时间
    • 与信用卡提供商的时间
  2. 现在,我有了一个模型,该模型过去是关于个人是否确实设法清算债务或宣布破产的信息。我将这种模型应用于目前无法偿还贷款的个人。

    这样做的前提是用未结案件补充结案案件。因此,未结案件将有可能“还款”或“治愈”。

现在我的输入表是这样的

Resolution_status dependent_var weight X1 X2 X3 X4 X5
Resolved          1             1      30 1500 3 3
Resolved          0             1      15 750  1 1
----------------------------------------------------------------
Unresolved        1             0.6    5  500  6 6
Unresolved        0             0.4    5  500  6 6 

我将未解决的案件分开,以确认每个观察都遵循这些规则   -每个未解决的观察都重复   -第一个被赋予治愈的“ 1”,权重等于模型在步骤1中估算的治愈可能性

使用权重声明有什么影响?我应该使用膨胀的零一beta回归还是分数logit模型?

我尝试使用SAShelp.baseball数据集运行上面的示例,以允许您运行它

      /*Split the dataset into resolved and unresolved*/
      DATA baseball_resolved
               baseball_unresolved
               ;
               SET sashelp.baseball
                         (KEEP = cr: logsalary);

               IF NOT MISSING(logsalary) THEN DO;
                         IF logsalary > 6.5 THEN flag = 1;
                         ELSE flag = 0;
               END;

               IF NOT MISSING(logsalary) THEN OUTPUT baseball_resolved;
               ELSE OUTPUT baseball_unresolved;

               DROP logSalary;
      RUN;

      /*Predict the model on the resolved cases*/
      PROC LOGISTIC DESCENDING
               OUTMODEL = in_model_baseball
               DATA = baseball_resolved
               PLOTS(ONLY) = NONE;
               MODEL flag (Event = '1') = cr:
               /
               SELECTION = NONE
               LINK = LOGIT
               ;
      RUN;
      QUIT;

      /*Apply the model to the unresolved cases*/
      PROC LOGISTIC
               INMODEL = in_model_baseball;
               SCORE DATA = baseball_unresolved
               OUT = unresolved_score
                         (KEEP = cr: p_1 p_0);
      RUN;

      /*Now output duplicate rows, with a weight attached*/
      DATA unresolved_baseball_p_cure;
               SET unresolved_score
                         (RENAME = (p_1 = weight));
               flag = 1;
               ;
               DROP p_0;
      RUN;

      DATA unresolved_baseball_p_non_cure;
               SET unresolved_score
                         (RENAME = (p_0 = weight));
               flag = 1;
               ;
               DROP p_1;
      RUN;

      /*Attach a weight of 1 to all resolved cases*/
      DATA baseball_resolved_weight;
               SET baseball_resolved;
               WEIGHT = 1;
      RUN;

      /*Merge the tables*/
      DATA full_table
               (rename = (weight = weight_var));
               SET
                         baseball_resolved_weight
                         unresolved_baseball_p_cure
                         unresolved_baseball_p_non_cure;
      RUN;

      /*Run a logistic regression with weight*/
      proc logistic
               data = full_table;
               model flag (EVENT = '1') = cr:;
               weight weight_var;
      RUN;

权重声明是否在我尝试的环境中起作用?我的目标实质上是对1和0进行逻辑回归,但要考虑到“未解决”的案例是重复的,并附加了“治愈的可能性”

1 个答案:

答案 0 :(得分:1)

weight语句将权重应用于整个行。它既有独立的,也有附属的

例如,如果数据集中只有这四行,

Resolution_status dependent_var weight X1 X2 X3 X4 X5
Resolved          1             1      30 1500 3 3
Resolved          0             1      15 750  1 1
Unresolved        1             0.6    5  500  6 6
Unresolved        0             0.4    5  500  6 6 

查看此问题的方式是:尽管实际上有4行,但出于所有计算目的,该数据集仅被理解为3(Sigma(weight)= 1 +1 + 0.6 + 0.4 = 3)行。

因此,当您在上述4个观测数据集上运行权重变量为'weight'的proc logistic时,从技术上讲,您将在以下方面建立logistic回归模型:

3个观察值,其中(dependent_var = 1)为1.6的观察值数;并且(dependent_var = 0)为1.4的观察次数;

权重还隐含在自变量(X1-X5)上。例如,如果要计算X1的平均值,则不再是(30 + 15 + 5 + 5)/ 4;而是(30 * 1 + 15 * 1 + 5 * 0.6 + 5 * 0.4)/ 3

从技术角度讲,这是重量。但是,关于您的前提的评论以及这种方法的有效性问题,在此我将不作评论,因为这取决于对您的案例和您的舒适度的理解,以及从信用风险角度做出的假设...

希望这对您有帮助...