我是pandas的新手并从sql迁移。 我有一个问题,我正在尝试用pandas替换sql-case语句 在高级别,我有一个输入数据框和一个参考表。我根据ref创建计算列。表 例 输入数据 ------------ ----------- + + + ---- ------------ + ----- + - ----- + |
STUDENT_ID | UG_MAJOR | C1 | C2 | C3 | C4 |
+------------+-----------+----+------------+-----+------+
| 123 | MATH | A | 8000-10000 | 12% | 9000 |
| 234 | ALL_OTHER | B | 1500-2000 | 10% | 1500 |
| 345 | ALL_OTHER | A | 2800-3000 | 8% | 2300 |
| 456 | ALL_OTHER | A | 8000-10000 | 12% | 3200 |
| 980 | ALL_OTHER | C | 1000-2500 | 15% | 2700 |
+------------+-----------+----+------------+-----+------+
参考数据
---------+---------+---------+
| REF_COL | REF_VAL | REF_SCR |
+---------+---------+---------+
| C1 | A | 10 |
| C1 | B | 20 |
| C1 | C | 30 |
| C1 | NULL | 0 |
| C1 | MISSING | 0 |
| C1 | A | 20 |
| C1 | B | 30 |
| C1 | C | 40 |
| C1 | NULL | 10 |
| C1 | MISSING | 10 |
| C2 | <1000 | 0 |
| C2 | >1000 | 20 |
| C2 | >7000 | 30 |
| C2 | >9500 | 40 |
| C2 | MISSING | 0 |
| C2 | NULL | 0 |
| C3 | <3% | 5 |
| C3 | >3% | 10 |
| C3 | >5% | 100 |
| C3 | >7% | 200 |
| C3 | >10% | 300 |
| C3 | NULL | 0 |
| C3 | MISSING | 0 |
| C4 | <5000 | 10 |
| C4 | >5000 | 20 |
| C4 | >10000 | 30 |
| C4 | >15000 | 40 |
+---------+---------+---------+
----------+-----------+----+------------+-----+------+--------+--------+--------+---------+
| Req.Output | | | | | | | | | |
+------------+-----------+----+------------+-----+------+--------+--------+--------+---------+
| STUDENT_ID | UG_MAJOR | C1 | C2 | C3 | C4 | C1_SCR | C2_SCR | C3_SCR | TOT_SCR |
| 123 | MATH | A | 8000-10000 | 12% | 9000 | | | | |
| 234 | ALL_OTHER | B | 1500-2000 | 10% | 1500 | | | | |
| 345 | ALL_OTHER | A | 2800-3000 | 8% | 2300 | | | | |
| 456 | ALL_OTHER | A | 8000-10000 | 12% | 3200 | | | | |
| 980 | ALL_OTHER | C | 1000-2500 | 15% | 2700 | | | | |
+------------+-----------+----+------------+-----+------+--------+--------+--------+---------+
传统的SQL方式是
select student_id,
UG_MAJOR,
C1,
case
when UG_MAJOR ='MATH' AND when C1 IS NULL THEN 0
when UG_MAJOR ='MATH' AND when C1 ='MISSING' THEN 0
when UG_MAJOR ='MATH' AND when C1 ='A' THEN 10
when UG_MAJOR ='MATH' AND when C1 ='B' THEN 20
when UG_MAJOR ='MATH' AND when C1 ='C' THEN 30
when UG_MAJOR ='ALL_OTHER' AND when C1 IS NULL THEN 0
when UG_MAJOR ='ALL_OTHER' AND when C1 ='MISSING' THEN 0
when UG_MAJOR ='ALL_OTHER' AND when C1 ='A' THEN 20
when UG_MAJOR ='ALL_OTHER' AND when C1 ='B' THEN 30
when UG_MAJOR ='ALL_OTHER' AND when C1 ='C' THEN 40
ELSE 'TBD' END AS C1_SCR,
C2,
CASE
WHEN C2 IS NULL THEN 0
WHEN C2 ='Missing' OR C2 = . THEN 0
WHEN C2<=1000 THEN 0
WHEN C2 >1000 AND C2<=7000 THEN 20
WHEN C2 >7000 AND C2<=9500 THEN 30
WHEN C2 >9500 THEN 40
ELSE 'TBD'
END AS C2_SCR
FROM REF_INPUT
GROUP BY 1,2,3,4,5,6
我想知道是否有一种优雅的方式来处理熊猫? 谢谢 帕
答案 0 :(得分:0)
正如上面的一些评论中所述,由于未提供此解决方案,因此我做了一些假设,但这是我尝试返回所请求的df的尝试,即使它不是很优雅...
dfc = df.copy()
dfc['c1_scr'] = 'TBD'
dfc = dfc.loc[((dfc.ug_major=='MATH')&(dfc.c1.isnull()))
|((dfc.ug_major=='MATH')&(dfc.c1=='Missing'))
|((dfc.ug_major=='ALL_OTHER')&(dfc.c1=='Missing'))
|((dfc.ug_major=='MATH')&(dfc.c1.isnull())),
'c1_scr'] = 0
dfc = dfc.loc[((dfc.ug_major=='MATH')&(dfc.c1=='A')),'c1_scr'] = 10
dfc = dfc.loc[((dfc.ug_major=='MATH')&(dfc.c1=='B'))
|((dfc.ug_major=='ALL_OTHER')&(dfc.c1=='A'))
,'c1_scr'] = 20
dfc = dfc.loc[((dfc.ug_major=='MATH')&(dfc.c1=='C'))
|((dfc.ug_major=='ALL_OTHER')&(dfc.c1=='A'))
,'c1_scr'] = 30
dfc = dfc.loc[((dfc.ug_major=='ALL_OTHER')&(dfc.c1=='C')),'c1_scr'] = 40
dfc['c2_scr'] = 'TBD'
dfc = dfc.loc[(dfc.c2.isnull())
|(dfc.c2=='MISSING')
|(dfc.c2=='.')
|(dfc.c2<=1000)
,'c2_scr'] = 0
dfc = dfc.loc[(dfc.c2>1000)
&(dfc.c2<=7000)
,'c2_scr'] = 20
dfc = dfc.loc[(dfc.c2>7000)
&(dfc.c2<=9500)
,'c2_scr'] = 30
dfc = dfc.loc[(dfc.c2>9500),'c2_scr'] = 40
dfc = dfc[['student_id','ug_major','c1','c1_scr'
,'c2','c2_scr']
].groupby(['student_id','ug_major','c1','c1_scr','c2','c2_scr'])
print(dfc.head())