我有一个看起来像这样的数据框
>>> df_w_cluster.select('high_income', 'aml_cluster_id').show(10)
+-----------+--------------+
|high_income|aml_cluster_id|
+-----------+--------------+
| 0| 0|
| 0| 0|
| 0| 1|
| 0| 1|
| 0| 0|
| 0| 0|
| 0| 1|
| 1| 1|
| 1| 0|
| 1| 0|
+-----------+--------------+
only showing top 10 rows
high_income
列是二进制列,包含0
或1
。 aml_cluster_id
保存从0
到3
的值。我想创建一个新列,其值取决于该特定行中high_income
和aml_cluster_id
的值。我正在尝试使用SQL实现此目的。
df_w_cluster.createTempView('event_rate_holder')
为了做到这一点,我写了一个像这样的查询 -
q = """select * , case
when "aml_cluster_id" = 0 and "high_income" = 1 then "high_income_encoded" = 0.162 else
when "aml_cluster_id" = 0 and "high_income" = 0 then "high_income_encoded" = 0.337 else
when "aml_cluster_id" = 1 and "high_income" = 1 then "high_income_encoded" = 0.049 else
when "aml_cluster_id" = 1 and "high_income" = 0 then "high_income_encoded" = 0.402 else
when "aml_cluster_id" = 2 and "high_income" = 1 then "high_income_encoded" = 0.005 else
when "aml_cluster_id" = 2 and "high_income" = 0 then "high_income_encoded" = 0.0 else
when "aml_cluster_id" = 3 and "high_income" = 1 then "high_income_encoded" = 0.023 else
when "aml_cluster_id" = 3 and "high_income" = 0 then "high_income_encoded" = 0.022 else
from event_rate_holder"""
当我使用
在spark中运行它时spark.sql(q)
我收到以下错误
mismatched input 'aml_cluster_id' expecting <EOF>(line 1, pos 22)
知道如何克服这个问题吗?
修改:
我根据以下
的评论中的建议编辑了查询q = """select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
when aml_cluster_id = 0 and high_income = 0 then high_income_encoded = 0.337 else
when aml_cluster_id = 1 and high_income = 1 then high_income_encoded = 0.049 else
when aml_cluster_id = 1 and high_income = 0 then high_income_encoded = 0.402 else
when aml_cluster_id = 2 and high_income = 1 then high_income_encoded = 0.005 else
when aml_cluster_id = 2 and high_income = 0 then high_income_encoded = 0.0 else
when aml_cluster_id = 3 and high_income = 1 then high_income_encoded = 0.023 else
when aml_cluster_id = 3 and high_income = 0 then high_income_encoded = 0.022 end
from event_rate_holder"""
但我仍然遇到错误
== SQL ==
select * , case
when aml_cluster_id = 0 and high_income = 1 then high_income_encoded = 0.162 else
-----^^^
接着是
pyspark.sql.utils.ParseException: "\nmismatched input 'aml_cluster_id' expecting <EOF>(line 2, pos 5)\n\n== SQL ==\nselect * ,
答案 0 :(得分:3)
您使用的CASE
变体的正确语法是
CASE
WHEN e1 THEN e2 [ ...n ]
[ ELSE else_result_expression ]
END
所以
name = something
的地方。ELSE
允许CASE
一次,而不是在WHEN
之后。END
你可能意味着
CASE
WHEN aml_cluster_id = 0 AND high_income = 1 THEN 0.162
WHEN aml_cluster_id = 0 and high_income = 0 THEN 0.337
...
END AS high_income_encoded
答案 1 :(得分:0)
在查询中的条件时,每个都需要案例结束。并且需要返回列表名称(<tbody>
<ng-container *ngFor="let detail of details; let i = index" >
<collapsible-table-row [attr.data-index]="i" [detail]="detail1">
<td>
<div class="input-group-text">
<input type="checkbox" aria-label="Checkbox for following text input">
</div>
</td>
<td >6565</td>
<td> {{detail.oid}} </td>
<td>{{detail.pname}}</td>
<td>{{detail.price}}</td>
<td>{{detail.qoh}}</td>
</collapsible-table-row>
<collapsible-table-row-detail #detail1 class="hidden-table">
<div class="container">
<div class="list col-5" >
<span class="heading"> Order details </span>
<ul class="unorderedlist">
<li> data4 </li>
<li> data5</li>
<li> data6 </li>
</ul>
</div>
</div>
</collapsible-table-row-detail>
</ng-container>
</tbody>
high_income_encoded`列名应该在末尾别名。所以正确的查询如下
) and