pyspark转置数据框

时间:2019-01-08 14:54:38

标签: pyspark

我有一个如下所示的数据框

ID, Code_Num, Code,              Code1,  Code2,  Code3

10, 1,       A1005*B1003,       A1005,  B1003,  null

12, 2,       A1007*D1008*C1004, A1007,  D1008,  C1004

在转置上述数据集时需要帮助,并且输出应显示如下。

ID, Code_Num, Code,              Code_T

10, 1,        A1005*B1003,       A1005

10, 1,        A1005*B1003,       B1003

12, 2,        A1007*D1008*C1004, A1007

12, 2,        A1007*D1008*C1004, D1008

12, 2,        A1007*D1008*C1004, C1004

1 个答案:

答案 0 :(得分:0)

步骤1:。创建DataFrame

values = [(10, 'A1005*B1003', 'A1005', 'B1003',  None),(12, 'A1007*D1008*C1004', 'A1007',  'D1008',  'C1004')]
df = sqlContext.createDataFrame(values,['ID','Code','Code1','Code2','Code3'])
df.show()
+---+-----------------+-----+-----+-----+
| ID|             Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10|      A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+

步骤2:爆炸DataFrame-

def to_transpose(df, by):

    # Filter dtypes and split into column names and type description
    cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
    # Spark SQL supports only homogeneous columns
    assert len(set(dtypes)) == 1, "All columns have to be of the same type"

    # Create and explode an array of (column_name, column_value) structs
    kvs = explode(array([
      struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
    ])).alias("kvs")

    return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])

df = to_transpose(df, ["ID","Code"]).drop('key').withColumnRenamed("val","Code_T")
df.show()
+---+-----------------+------+
| ID|             Code|Code_T|
+---+-----------------+------+
| 10|      A1005*B1003| A1005|
| 10|      A1005*B1003| B1003|
| 10|      A1005*B1003|  null|
| 12|A1007*D1008*C1004| A1007|
| 12|A1007*D1008*C1004| D1008|
| 12|A1007*D1008*C1004| C1004|
+---+-----------------+------+

如果只需要non-Null列中的Code_T个值,只需运行下面的语句-

df = df.where(col('Code_T').isNotNull())