Spark:如何在pyspark或scala spark中展开数据并添加列名?

时间:2018-02-12 14:28:03

标签: apache-spark apache-spark-sql spark-dataframe apache-spark-dataset

Spark:我想爆炸多个列并合并为单列,列名作为单独的行。

Input data: 
    +-----------+-----------+-----------+
    |   ASMT_ID |   WORKER  |   LABOR   |
    +-----------+-----------+-----------+
    |   1       |   A1,A2,A3|   B1,B2   |
    +-----------+-----------+-----------+
    |   2       |   A1,A4   |   B1      |
    +-----------+-----------+-----------+

Expected Output:


+-----------+-----------+-----------+
|   ASMT_ID |WRK_CODE   |WRK_DETL   |
+-----------+-----------+-----------+
|   1       |   A1      |   WORKER  |
+-----------+-----------+-----------+
|   1       |   A2      |   WORKER  |
+-----------+-----------+-----------+
|   1       |   A3      |   WORKER  |
+-----------+-----------+-----------+
|   1       |   B1      |   LABOR   |
+-----------+-----------+-----------+
|   1       |   B2      |   LABOR   |
+-----------+-----------+-----------+
|   2       |   A1      |   WORKER  |
+-----------+-----------+-----------+
|   2       |   A4      |   WORKER  |
+-----------+-----------+-----------+
|   2       |   B1      |   LABOR   |
+-----------+-----------+-----------+

PFA: Input image

2 个答案:

答案 0 :(得分:1)

可能不是最好的情况,但只需要几个explodeunionAll

import org.apache.spark.sql.functions._

df1.show
+-------+--------+-----+
|ASMT_ID|  WORKER|LABOR|
+-------+--------+-----+
|      1|A1,A2,A3|B1,B2|
|      2|   A1,A4|   B1|
+-------+--------+-----+

df1.cache

val workers = df1.drop("LABOR")
                 .withColumn("WRK_CODE" , explode(split($"WORKER" , ",") ) )
                 .withColumn("WRK_DETL", lit("WORKER"))
                 .drop("WORKER")

val labors = df1.drop("WORKER")
                .withColumn("WRK_CODE" , explode(split($"LABOR", ",") ) )
                .withColumn("WRK_DETL", lit("LABOR") )
                .drop("LABOR")

workers.unionAll(labors).orderBy($"ASMT_ID".asc , $"WRK_CODE".asc).show

+-------+--------+--------+
|ASMT_ID|WRK_CODE|WRK_DETL|
+-------+--------+--------+
|      1|      A1|  WORKER|
|      1|      A2|  WORKER|
|      1|      A3|  WORKER|
|      1|      B1|   LABOR|
|      1|      B2|   LABOR|
|      2|      A1|  WORKER|
|      2|      A4|  WORKER|
|      2|      B1|   LABOR|
+-------+--------+--------+

答案 1 :(得分:-1)

另一种解决方案。

from pyspark.sql.functions import explode, lit

df = spark.createDataFrame([
   ("1", ["A1","A2", "A3"], ["B1", "B2"]),
   ("2", ["A1","A4"], ["B1"])],
  ['ASMT_ID', 'WORKER', 'LABOR'])

df.select('ASMT_ID', explode('WORKER').alias('WRK_CODE'), lit('WORKDER').alias('WRK_DETL') )\
    .unionAll(df.select('ASMT_ID', explode('LABOR').alias('WRK_CODE'), lit('LABOR').alias('WRK_DETL')))\
    .orderBy(['ASMT_ID', 'WRK_CODE']).show()
+-------+--------+--------+
|ASMT_ID|WRK_CODE|WRK_DETL|
+-------+--------+--------+
|      1|      A1| WORKDER|
|      1|      A2| WORKDER|
|      1|      A3| WORKDER|
|      1|      B1|   LABOR|
|      1|      B2|   LABOR|
|      2|      A1| WORKDER|
|      2|      A4| WORKDER|
|      2|      B1|   LABOR|
+-------+--------+--------+