Pyspark SQL:使用case语句

时间:2018-05-14 12:18:14

标签: sql apache-spark pyspark apache-spark-sql pyspark-sql

我有一个看起来像这样的数据框

>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
|          0|             0|
|          0|             0|
|          0|             1|
|          0|             1|
|          0|             0|
|          0|             0|
|          0|             1|
|          1|             1|
|          1|             0|
|          1|             0|
+-----------+--------------+
only showing top 10 rows

high_income列是二进制列,包含01aml_cluster_id保存从03的值。我想创建一个新列,其值取决于该特定行中high_incomeaml_cluster_id的值。我正在尝试使用SQL实现此目的。

df_w_cluster.createTempView('event_rate_holder')

为了做到这一点,我写了一个像这样的查询 -

q = """select * , case 
 when "aml_cluster_id" = 0 and  "high_income" = 1 then "high_income_encoded" = 0.162 else 
 when "aml_cluster_id" = 0 and  "high_income" = 0 then "high_income_encoded" = 0.337 else 
 when "aml_cluster_id" = 1 and  "high_income" = 1 then "high_income_encoded" = 0.049 else 
 when "aml_cluster_id" = 1 and  "high_income" = 0 then "high_income_encoded" = 0.402 else 
 when "aml_cluster_id" = 2 and  "high_income" = 1 then "high_income_encoded" = 0.005 else 
 when "aml_cluster_id" = 2 and  "high_income" = 0 then "high_income_encoded" = 0.0 else 
 when "aml_cluster_id" = 3 and  "high_income" = 1 then "high_income_encoded" = 0.023 else 
 when "aml_cluster_id" = 3 and  "high_income" = 0 then "high_income_encoded" = 0.022 else 
 from event_rate_holder"""

当我使用

在spark中运行它时
spark.sql(q)

我收到以下错误

mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)

知道如何克服这个问题吗?

修改

我根据以下

的评论中的建议编辑了查询
q = """select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
when aml_cluster_id = 0 and  high_income = 0 then high_income_encoded = 0.337 else 
when aml_cluster_id = 1 and  high_income = 1 then high_income_encoded = 0.049 else 
when aml_cluster_id = 1 and  high_income = 0 then high_income_encoded = 0.402 else 
when aml_cluster_id = 2 and  high_income = 1 then high_income_encoded = 0.005 else 
when aml_cluster_id = 2 and  high_income = 0 then high_income_encoded = 0.0 else 
when aml_cluster_id = 3 and  high_income = 1 then high_income_encoded = 0.023 else 
when aml_cluster_id = 3 and  high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""

但我仍然遇到错误

== SQL ==
select * , case 
when aml_cluster_id = 0 and  high_income = 1 then high_income_encoded = 0.162 else 
-----^^^

接着是

pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,

2 个答案:

答案 0 :(得分:3)

您使用的CASE变体的正确语法是

CASE  
   WHEN e1 THEN e2 [ ...n ]   
   [ ELSE else_result_expression ]   
END  

所以

  • 然后应该跟着表达。那里没有name = something的地方。
  • 每个ELSE允许
  • CASE一次,而不是在WHEN之后。
  • 您的原始代码缺少关闭END
  • 最后不应引用列

你可能意味着

CASE 
  WHEN aml_cluster_id = 0 AND high_income = 1 THEN 0.162
  WHEN aml_cluster_id = 0 and  high_income = 0 THEN  0.337
  ...
END AS high_income_encoded 

答案 1 :(得分:0)

在查询中的条件时,每个都需要案例结束。并且需要返回列表名称<tbody> <ng-container *ngFor="let detail of details; let i = index" > <collapsible-table-row [attr.data-index]="i" [detail]="detail1"> <td> <div class="input-group-text"> <input type="checkbox" aria-label="Checkbox for following text input"> </div> </td> <td >6565</td> <td> {{detail.oid}} </td> <td>{{detail.pname}}</td> <td>{{detail.price}}</td> <td>{{detail.qoh}}</td> </collapsible-table-row> <collapsible-table-row-detail #detail1 class="hidden-table"> <div class="container"> <div class="list col-5" > <span class="heading"> Order details </span> <ul class="unorderedlist"> <li> data4 </li> <li> data5</li> <li> data6 </li> </ul> </div> </div> </collapsible-table-row-detail> </ng-container> </tbody> high_income_encoded`列名应该在末尾别名。所以正确的查询如下

) and