我试图通过使用withColumn()函数并在pySpark中的withColum()函数中调用udf来弄清楚如何为列表中的每个项目(在本例中为cp_codeset list)动态创建列。下面是我写的代码,但它给了我一个错误。
<form [formGroup]="sForm" novalidate (ngSubmit)="save()">
<div class="form-group" [class.has-error]="!sForm.controls.hoverDelay.valid">
<label class="col-sm-2 control-label" for="hoverDelay">Hover Delay (millilseconds):</label>
<div class="col-sm-10">
<input formControlName="hoverDelay" id="hoverDelay" type="text" class="form-control">
</div>
<span [hidden]="sForm.controls.hoverDelay.valid" [class.help-block]="!sForm.controls.hoverDelay.valid">
HoverDelay is required and needs to be between 500 - 2000
</span>
</div>
<div class="form-group">
<label class="col-sm-2 control-label" for="hoverActionGroup">On Hover:</label>
<div class="btn-group" id="hoverActionGroup" data-toggle="buttons">
<label class="btn btn-primary" [class.active]="sForm.value.hoverAction === 1">
<input type="radio" name="hoverAction" value="1" formControlName="hoverAction" />Modal
</label>
<label class="btn btn-primary" [class.active]="sForm.value.hoverAction === 2">
<input type="radio" name="hoverAction" value="2" formControlName="hoverAction" />Navigate
</label>
<label class="btn btn-primary" [class.active]="sForm.value.hoverAction === 0">
<input type="radio" name="hoverAction" value="0" formControlName="hoverAction" />Do Nothing
</label>
</div>
</div>
另一种选择是手动执行但是在这种情况下我必须编写相同的udf函数并使用withColumn()函数调用它75次(这是cp_codeset [“col_names”]的大小)
以下是我的两个数据框,我试图了解结果的显示方式
from pyspark.sql.functions import udf, col, lit
from pyspark.sql import Row
from pyspark.sql.types import IntegerType
codeset = set(cp_codeset['CODE'])
for col_name in cp_codeset.col_names.unique():
def flag(d):
if (d in codeset):
name = cp_codeset[cp_codeset['CODES']==d].col_names
if(name==col_name):
return 1
else:
return 0
cpf_udf = udf(flag, IntegerType())
p.withColumn(col_name, cpf_udf(p.codes)).show()
id|codes
1|100
2|102
3|104
codes| col_names
100|a
101|b
102|c
103|d
104|e
105|f
答案 0 :(得分:2)
过滤了这些数据:
cp_codeset.set_index('codes').loc[p.codes]
Out[44]:
col_names
codes
100 a
102 c
104 e
只需使用get_dummies
:
pd.get_dummies(cp_codeset.set_index('codes').loc[p.codes])
Out[45]:
col_names_a col_names_c col_names_e
codes
100 1 0 0
102 0 1 0
104 0 0 1
答案 1 :(得分:2)
我将get_dummies
与join
+ map
m = cp_codeset.set_index('codes').col_names
P.join(pd.get_dummies(P.codes.map(m)))
id codes a c e
0 1 100 1 0 0
1 2 102 0 1 0
2 3 104 0 0 1
boolean