从pyspark中的现有数据帧创建多个数据帧

时间:2021-04-22 20:35:04

标签: apache-spark pyspark apache-spark-sql

我在 pyspark 中有一个数据框,如下所示

data = [{"B_ID": 'TEST', "Category": 'Category A', "ID": 1, "Value": 1},
        {"B_ID": 'TEST', "Category": 'Category B', "ID": 2, "Value": 2},
        {"B_ID": 'TEST', "Category": 'Category C', "ID": 3, "Value": None},
        {"B_ID": 'TEST', "Category": 'Category D', "ID": 4, "Value": 3},
        ]

df = spark.createDataFrame(data)
df.show()

+----+----------+---+-----+
|B_ID|  Category| ID|Value|
+----+----------+---+-----+
|TEST|Category A|  1|    1|
|TEST|Category B|  2|    2|
|TEST|Category C|  3| null|
|TEST|Category D|  4|    3|
+----+----------+---+-----+

现在从上面的数据框中,我想通过更改某些列中的列值来创建一些数据框。

我已经做了如下

import pyspark.sql.functions as f
from functools import reduce

value_1 = 'TEST_1'

# changing B_ID column values and ID column values
df1 = df.withColumn("B_ID", f.lit(value_1)).withColumn("id", f.lit(5))
df1.show()
+------+----------+---+-----+
|  B_ID|  Category| id|Value|
+------+----------+---+-----+
|TEST_1|Category A|  5|    1|
|TEST_1|Category B|  5|    2|
|TEST_1|Category C|  5| null|
|TEST_1|Category D|  5|    3|
+------+----------+---+-----+


value_2 = 'TESTING'
df2 = df.withColumn("B_ID", f.lit(value_2)).withColumn("id", f.col("id"))
df2.show()
+-------+----------+---+-----+
|   B_ID|  Category| id|Value|
+-------+----------+---+-----+
|TESTING|Category A|  1|    1|
|TESTING|Category B|  2|    2|
|TESTING|Category C|  3| null|
|TESTING|Category D|  4|    3|
+-------+----------+---+-----+

df3 = df.withColumn("B_ID", f.col("B_ID")).withColumn("id", f.lit("6"))
df3.show()

+----+----------+---+-----+
|B_ID|  Category| id|Value|
+----+----------+---+-----+
|TEST|Category A|  6|    1|
|TEST|Category B|  6|    2|
|TEST|Category C|  6| null|
|TEST|Category D|  6|    3|
+----+----------+---+-----+

现在创建数据框后,我想合并所有新创建的数据框

我做了如下 # 要合并的数据框列表 list_df = [df1, df2, df3]

# union all the data frames
final_df = reduce(f.DataFrame.union, list_df)

final_df.show()
+-------+----------+---+-----+
|   B_ID|  Category| id|Value|
+-------+----------+---+-----+
| TEST_1|Category A|  5|    1|
| TEST_1|Category B|  5|    2|
| TEST_1|Category C|  5| null|
| TEST_1|Category D|  5|    3|
|TESTING|Category A|  1|    1|
|TESTING|Category B|  2|    2|
|TESTING|Category C|  3| null|
|TESTING|Category D|  4|    3|
|   TEST|Category A|  6|    1|
|   TEST|Category B|  6|    2|
|   TEST|Category C|  6| null|
|   TEST|Category D|  6|    3|
+-------+----------+---+-----+

我正在实现我想要的。但我想知道是否还有其他更好的方法来实现我的结果。

1 个答案:

答案 0 :(得分:1)

这是使用内联爆炸的另一种方法:

df2 = df.selectExpr(
    'Category',
    'Value',
    "inline(array(('TEST_1' as B_ID, 5 as id), ('TESTING' as B_ID, id), (B_ID, 6 as id)))"
).select(df.columns)

df2.show()
+-------+----------+---+-----+
|   B_ID|  Category| ID|Value|
+-------+----------+---+-----+
| TEST_1|Category A|  5|    1|
|TESTING|Category A|  1|    1|
|   TEST|Category A|  6|    1|
| TEST_1|Category B|  5|    2|
|TESTING|Category B|  2|    2|
|   TEST|Category B|  6|    2|
| TEST_1|Category C|  5| null|
|TESTING|Category C|  3| null|
|   TEST|Category C|  6| null|
| TEST_1|Category D|  5|    3|
|TESTING|Category D|  4|    3|
|   TEST|Category D|  6|    3|
+-------+----------+---+-----+