通用评分R.

时间:2015-12-19 02:05:53

标签: r conditional scoring

我是R的新手并从sql移动。我有一个问题,我试图用R语句替换sql-case。在高级别,我有一个输入数据框和一个引用表。我根据ref创建计算列。表示例输入数据

 ------------+-----------+----+------------+-----+------+ |
  STUDENT_ID | UG_MAJOR  | C1 |     C2     | C3  |  C4  |
+------------+-----------+----+------------+-----+------+
|        123 | MATH      | A  | 8000-10000 | 12% | 9000 |
|        234 | ALL_OTHER | B  | 1500-2000  | 10% | 1500 |
|        345 | ALL_OTHER | A  | 2800-3000  | 8%  | 2300 |
|        456 | ALL_OTHER | A  | 8000-10000 | 12% | 3200 |
|        980 | ALL_OTHER | C  | 1000-2500  | 15% | 2700 |
+------------+-----------+----+------------+-----+------+

参考数据

> UG_MAJOR  REF_COL REF_VAL REF_SCORE
>     MATH  C1  A   10
>     MATH  C1  B   20
>     MATH  C1  C   30
>     MATH  C1  NULL    0
>     MATH  C1  MISSING 0
>     ALL_OTHER C1  A   20
>     ALL_OTHER C1  B   30
>     ALL_OTHER C1  C   40
>     ALL_OTHER C1  NULL    10
>     ALL_OTHER C1  MISSING 10
>     DEFAULT   C2  <1000   0
>     DEFAULT   C2  >1000   20
>     DEFAULT   C2  >7000   30
>     DEFAULT   C2  >9500   40
>     DEFAULT   C2  MISSING 0
>     DEFAULT   C2  NULL    0
>     DEFAULT   C3  <3% 5
>     DEFAULT   C3  >3% 10
>     DEFAULT   C3  >5% 100
>     DEFAULT   C3  >7% 200
>     DEFAULT   C3  >10%    300
>     DEFAULT   C3  NULL    0
>     DEFAULT   C3  MISSING 0
>     DEFAULT   C4  <5000   10
>     DEFAULT   C4  >5000   20
>     DEFAULT   C4  >10000  30
>     DEFAULT   C4  >15000  40

预期输出

----------+-----------+----+------------+-----+------+--------+--------+--------+---------+
| Req.Output |           |    |            |     |      |        |        |        |         |
+------------+-----------+----+------------+-----+------+--------+--------

+--------+---------+
| STUDENT_ID | UG_MAJOR  | C1 | C2         | C3  | C4   | C1_SCR | C2_SCR | C3_SCR | TOT_SCR |
| 123        | MATH      | A  | 8000-10000 | 12% | 9000 |  10      |        |        |         |
| 234        | ALL_OTHER | B  | 1500-2000  | 10% | 1500 |  20     |        |        |         |
| 345        | ALL_OTHER | A  | 2800-3000  | 8%  | 2300 |  10     |        |        |         |
| 456        | ALL_OTHER | A  | 8000-10000 | 12% | 3200 |  30     |        |        |         |
| 980        | ALL_OTHER | C  | 1000-2500  | 15% | 2700 |  40      |        |        |         |
+------------+-----------+----+------------+-----+------+--------+--------+--------+---------+

传统的SQL方式是

select student_id, 
UG_MAJOR, 
C1,
case 
when UG_MAJOR ='MATH' AND when C1 IS NULL THEN 0
when UG_MAJOR ='MATH' AND when C1 ='MISSING' THEN 0
when UG_MAJOR ='MATH' AND when C1 ='A' THEN 10
when UG_MAJOR ='MATH' AND when C1 ='B' THEN 20
when UG_MAJOR ='MATH' AND when C1 ='C' THEN 30

when UG_MAJOR ='ALL_OTHER' AND when C1 IS NULL THEN 0
when UG_MAJOR ='ALL_OTHER' AND when C1 ='MISSING' THEN 0
when UG_MAJOR ='ALL_OTHER' AND when C1 ='A' THEN 20
when UG_MAJOR ='ALL_OTHER' AND when C1 ='B' THEN 30
when UG_MAJOR ='ALL_OTHER' AND when C1 ='C' THEN 40

ELSE 'TBD' END AS C1_SCR,

C2,
CASE 
WHEN C2 IS NULL THEN 0
WHEN C2 ='Missing' OR C2 = . THEN 0
WHEN C2<=1000 THEN 0
WHEN C2 >1000 AND C2<=7000 THEN 20
WHEN C2 >7000 AND C2<=9500 THEN 30
WHEN C2 >9500 THEN 40
ELSE 'TBD' 
END AS C2_SCR

FROM REF_INPUT
GROUP BY 1,2,3,4,5,6

我想知道R中是否有一种优雅的处理方式?谢谢Par

1 个答案:

答案 0 :(得分:1)

我认为你的解决方案是一个简单的&#34;加入&#34;在输入数据表和参考表之间使用多列,即使问题中的SQL代码没有表明这一点,但显示了一个&#34;硬编码&#34;参考表。

使用包data.table,解决方案可能是此代码中的最后一行(其余部分是创建问题数据所必需的):

library(data.table)

# your data
input <- setDT(structure(list(STUDENT_ID = c(123L, 234L, 345L, 456L, 980L), 
                        UG_MAJOR = c("MATH", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER"),
                        C1 = c("A", "B", "A", "A", "C"),
                        C2 = c("8000-10000", "1500-2000", "2800-3000", "8000-10000", "1000-2500"),
                        C3 = c("12%", "10%", "8%", "12%", "15%"),
                        C4 = c(9000L, 1500L, 2300L, 3200L, 2700L)),
                        .Names = c("STUDENT_ID", "UG_MAJOR", "C1", "C2", "C3", "C4"),
                        class = "data.frame", row.names = c(NA, -5L)))
input

# this is an incomplete list of your reference data (for demo purposes only)
refdata <-
  setDT(structure(
    list(
      UG_MAJOR = c(
        "MATH", "MATH", "MATH", "MATH", "MATH",
        "ALL_OTHER", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER"
      ), REF_COL = c("C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1",
                     "C1", "C1"), REF_VAL = c("A", "B", "C", "NULL", "MISSING", "A",
                                              "B", "C", "NULL", "MISSING"), REF_SCORE = c(10L, 20L, 30L, 0L,
                                                                                          0L, 20L, 30L, 40L, 10L, 10L)
    ), .Names = c("UG_MAJOR", "REF_COL",
                  "REF_VAL", "REF_SCORE"), class = "data.frame", row.names = c(NA,-10L)
  ))
refdata

# Join your data to the reference data table using multiple join columns and add a new column to input containing the score
input[refdata[REF_COL=="C1",], C1_SCR := REF_SCORE, on=c(UG_MAJOR="UG_MAJOR", C1="REF_VAL") ][]

结果:

   STUDENT_ID  UG_MAJOR C1         C2  C3   C4 C1_SCR
1:        123      MATH  A 8000-10000 12% 9000     10
2:        234 ALL_OTHER  B  1500-2000 10% 1500     30
3:        345 ALL_OTHER  A  2800-3000  8% 2300     20
4:        456 ALL_OTHER  A 8000-10000 12% 3200     20
5:        980 ALL_OTHER  C  1000-2500 15% 2700     40

未解决的问题

  • 您的问题中的结果分数似乎与我的不同 (您是否真的使用参考数据创建了结果?)

  • 将查找失败设置为0(值&#34;零&#34;)未实现(将 是NA,但NA可以在第二步中被0替换)

  • 创建
    其他栏目C2_SCR,C3_SCR和C4_SCR你必须申请
    相同的逻辑(来自最后一行代码)