将标签分配给PySpark中表中的分类数据

时间:2016-11-27 00:49:36

标签: pyspark

我想使用pyspark sql将标签分配给下面数据框中的分类号。

在MARRIAGE栏中,1 =已婚,2 =未婚。在教育专栏1 = Grad和2 = Undergrad

Current Dataframe:
+--------+---------+-----+
|MARRIAGE|EDUCATION|Total|
+--------+---------+-----+
|       1|        2|   87|
|       1|        1|  123|
|       2|        2|    3|
|       2|        1|    8|
+--------+---------+-----+
Resulting Dataframe:
+---------+---------+-----+
|MARRIAGE |EDUCATION|Total|
+---------+---------+-----+
|Married  |Grad     |   87|
|Married  |UnderGrad|  123|
|UnMarried|Grad     |    3|
|UnMarried|UnderGrad|    8|
+---------+---------+-----+

是否可以使用单个udf和withColumn()分配标签?有没有办法在单个UDF中通过传递整个数据帧来分配并保持列名不变?

我可以考虑使用单独的udf对每个列执行操作的解决方案,如下所示。但无法弄清楚是否有办法一起做。

from pyspark.sql import functions as F

def assign_marital_names(record):
    if record == 1:
        return "Married"
    elif record == 2:
        return "UnMarried"


def assign_edu_names(record):
    if record == 1:
        return "Grad"
    elif record == 2:
        return "UnderGrad"

assign_marital_udf = F.udf(assign_marital_names)
assign_edu_udf = F.udf(assign_edu_names)
df.withColumn("MARRIAGE", assign_marital_udf("MARRIAGE")).\
withColumn("EDUCATION", assign_edu_udf("EDUCATION")).show(truncate=False)

1 个答案:

答案 0 :(得分:0)

一个UDF只能生成一列。但这可以是结构化的专栏,UDF可以在婚姻和教育上应用标签。请参阅以下代码:

List<Object>

但是如你所见,它并没有取代原来的专栏,只是添加一个新专栏。要替换它们,您需要使用from pyspark.sql.types import * from pyspark.sql import Row udf_result = StructType([StructField('MARRIAGE', StringType()), StructField('EDUCATION', StringType())]) marriage_dict = {1: 'Married', 2: 'UnMarried'} education_dict = {1: 'Grad', 2: 'UnderGrad'} def assign_labels(marriage, education): return Row(marriage_dict[marriage], education_dict[education]) assign_labels_udf = F.udf(assign_labels, udf_result) df.withColumn('labels', assign_labels_udf('MARRIAGE', 'EDUCATION')).printSchema() root |-- MARRIAGE: long (nullable = true) |-- EDUCATION: long (nullable = true) |-- Total: long (nullable = true) |-- labels: struct (nullable = true) | |-- MARRIAGE: string (nullable = true) | |-- EDUCATION: string (nullable = true) 两次,然后删除withColumn