我想使用pyspark sql将标签分配给下面数据框中的分类号。
在MARRIAGE栏中,1 =已婚,2 =未婚。在教育专栏1 = Grad和2 = Undergrad
Current Dataframe: +--------+---------+-----+ |MARRIAGE|EDUCATION|Total| +--------+---------+-----+ | 1| 2| 87| | 1| 1| 123| | 2| 2| 3| | 2| 1| 8| +--------+---------+-----+
Resulting Dataframe: +---------+---------+-----+ |MARRIAGE |EDUCATION|Total| +---------+---------+-----+ |Married |Grad | 87| |Married |UnderGrad| 123| |UnMarried|Grad | 3| |UnMarried|UnderGrad| 8| +---------+---------+-----+
是否可以使用单个udf和withColumn()分配标签?有没有办法在单个UDF中通过传递整个数据帧来分配并保持列名不变?
我可以考虑使用单独的udf对每个列执行操作的解决方案,如下所示。但无法弄清楚是否有办法一起做。
from pyspark.sql import functions as F
def assign_marital_names(record):
if record == 1:
return "Married"
elif record == 2:
return "UnMarried"
def assign_edu_names(record):
if record == 1:
return "Grad"
elif record == 2:
return "UnderGrad"
assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)
答案 0 :(得分:0)
一个UDF只能生成一列。但这可以是结构化的专栏,UDF可以在婚姻和教育上应用标签。请参阅以下代码:
List<Object>
但是如你所见,它并没有取代原来的专栏,只是添加一个新专栏。要替换它们,您需要使用from pyspark.sql.types import *
from pyspark.sql import Row
udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())])
marriage_dict = {1: 'Married', 2: 'UnMarried'}
education_dict = {1: 'Grad', 2: 'UnderGrad'}
def assign_labels(marriage, education):
return Row(marriage_dict[marriage], education_dict[education])
assign_labels_udf = F.udf(assign_labels, udf_result)
df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema()
root
|-- MARRIAGE: long (nullable = true)
|-- EDUCATION: long (nullable = true)
|-- Total: long (nullable = true)
|-- labels: struct (nullable = true)
| |-- MARRIAGE: string (nullable = true)
| |-- EDUCATION: string (nullable = true)
两次,然后删除withColumn
。