我是R的新手并从sql移动。我有一个问题,我试图用R语句替换sql-case。在高级别,我有一个输入数据框和一个引用表。我根据ref创建计算列。表示例输入数据
------------+-----------+----+------------+-----+------+ |
STUDENT_ID | UG_MAJOR | C1 | C2 | C3 | C4 |
+------------+-----------+----+------------+-----+------+
| 123 | MATH | A | 8000-10000 | 12% | 9000 |
| 234 | ALL_OTHER | B | 1500-2000 | 10% | 1500 |
| 345 | ALL_OTHER | A | 2800-3000 | 8% | 2300 |
| 456 | ALL_OTHER | A | 8000-10000 | 12% | 3200 |
| 980 | ALL_OTHER | C | 1000-2500 | 15% | 2700 |
+------------+-----------+----+------------+-----+------+
参考数据
> UG_MAJOR REF_COL REF_VAL REF_SCORE
> MATH C1 A 10
> MATH C1 B 20
> MATH C1 C 30
> MATH C1 NULL 0
> MATH C1 MISSING 0
> ALL_OTHER C1 A 20
> ALL_OTHER C1 B 30
> ALL_OTHER C1 C 40
> ALL_OTHER C1 NULL 10
> ALL_OTHER C1 MISSING 10
> DEFAULT C2 <1000 0
> DEFAULT C2 >1000 20
> DEFAULT C2 >7000 30
> DEFAULT C2 >9500 40
> DEFAULT C2 MISSING 0
> DEFAULT C2 NULL 0
> DEFAULT C3 <3% 5
> DEFAULT C3 >3% 10
> DEFAULT C3 >5% 100
> DEFAULT C3 >7% 200
> DEFAULT C3 >10% 300
> DEFAULT C3 NULL 0
> DEFAULT C3 MISSING 0
> DEFAULT C4 <5000 10
> DEFAULT C4 >5000 20
> DEFAULT C4 >10000 30
> DEFAULT C4 >15000 40
预期输出
----------+-----------+----+------------+-----+------+--------+--------+--------+---------+
| Req.Output | | | | | | | | | |
+------------+-----------+----+------------+-----+------+--------+--------
+--------+---------+
| STUDENT_ID | UG_MAJOR | C1 | C2 | C3 | C4 | C1_SCR | C2_SCR | C3_SCR | TOT_SCR |
| 123 | MATH | A | 8000-10000 | 12% | 9000 | 10 | | | |
| 234 | ALL_OTHER | B | 1500-2000 | 10% | 1500 | 20 | | | |
| 345 | ALL_OTHER | A | 2800-3000 | 8% | 2300 | 10 | | | |
| 456 | ALL_OTHER | A | 8000-10000 | 12% | 3200 | 30 | | | |
| 980 | ALL_OTHER | C | 1000-2500 | 15% | 2700 | 40 | | | |
+------------+-----------+----+------------+-----+------+--------+--------+--------+---------+
传统的SQL方式是
select student_id,
UG_MAJOR,
C1,
case
when UG_MAJOR ='MATH' AND when C1 IS NULL THEN 0
when UG_MAJOR ='MATH' AND when C1 ='MISSING' THEN 0
when UG_MAJOR ='MATH' AND when C1 ='A' THEN 10
when UG_MAJOR ='MATH' AND when C1 ='B' THEN 20
when UG_MAJOR ='MATH' AND when C1 ='C' THEN 30
when UG_MAJOR ='ALL_OTHER' AND when C1 IS NULL THEN 0
when UG_MAJOR ='ALL_OTHER' AND when C1 ='MISSING' THEN 0
when UG_MAJOR ='ALL_OTHER' AND when C1 ='A' THEN 20
when UG_MAJOR ='ALL_OTHER' AND when C1 ='B' THEN 30
when UG_MAJOR ='ALL_OTHER' AND when C1 ='C' THEN 40
ELSE 'TBD' END AS C1_SCR,
C2,
CASE
WHEN C2 IS NULL THEN 0
WHEN C2 ='Missing' OR C2 = . THEN 0
WHEN C2<=1000 THEN 0
WHEN C2 >1000 AND C2<=7000 THEN 20
WHEN C2 >7000 AND C2<=9500 THEN 30
WHEN C2 >9500 THEN 40
ELSE 'TBD'
END AS C2_SCR
FROM REF_INPUT
GROUP BY 1,2,3,4,5,6
我想知道R中是否有一种优雅的处理方式?谢谢Par
答案 0 :(得分:1)
我认为你的解决方案是一个简单的&#34;加入&#34;在输入数据表和参考表之间使用多列,即使问题中的SQL代码没有表明这一点,但显示了一个&#34;硬编码&#34;参考表。
使用包data.table
,解决方案可能是此代码中的最后一行(其余部分是创建问题数据所必需的):
library(data.table)
# your data
input <- setDT(structure(list(STUDENT_ID = c(123L, 234L, 345L, 456L, 980L),
UG_MAJOR = c("MATH", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER"),
C1 = c("A", "B", "A", "A", "C"),
C2 = c("8000-10000", "1500-2000", "2800-3000", "8000-10000", "1000-2500"),
C3 = c("12%", "10%", "8%", "12%", "15%"),
C4 = c(9000L, 1500L, 2300L, 3200L, 2700L)),
.Names = c("STUDENT_ID", "UG_MAJOR", "C1", "C2", "C3", "C4"),
class = "data.frame", row.names = c(NA, -5L)))
input
# this is an incomplete list of your reference data (for demo purposes only)
refdata <-
setDT(structure(
list(
UG_MAJOR = c(
"MATH", "MATH", "MATH", "MATH", "MATH",
"ALL_OTHER", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER", "ALL_OTHER"
), REF_COL = c("C1", "C1", "C1", "C1", "C1", "C1", "C1", "C1",
"C1", "C1"), REF_VAL = c("A", "B", "C", "NULL", "MISSING", "A",
"B", "C", "NULL", "MISSING"), REF_SCORE = c(10L, 20L, 30L, 0L,
0L, 20L, 30L, 40L, 10L, 10L)
), .Names = c("UG_MAJOR", "REF_COL",
"REF_VAL", "REF_SCORE"), class = "data.frame", row.names = c(NA,-10L)
))
refdata
# Join your data to the reference data table using multiple join columns and add a new column to input containing the score
input[refdata[REF_COL=="C1",], C1_SCR := REF_SCORE, on=c(UG_MAJOR="UG_MAJOR", C1="REF_VAL") ][]
结果:
STUDENT_ID UG_MAJOR C1 C2 C3 C4 C1_SCR
1: 123 MATH A 8000-10000 12% 9000 10
2: 234 ALL_OTHER B 1500-2000 10% 1500 30
3: 345 ALL_OTHER A 2800-3000 8% 2300 20
4: 456 ALL_OTHER A 8000-10000 12% 3200 20
5: 980 ALL_OTHER C 1000-2500 15% 2700 40
未解决的问题:
您的问题中的结果分数似乎与我的不同 (您是否真的使用参考数据创建了结果?)
将查找失败设置为0(值&#34;零&#34;)未实现(将 是NA,但NA可以在第二步中被0替换)
创建
其他栏目C2_SCR,C3_SCR和C4_SCR你必须申请
相同的逻辑(来自最后一行代码)