我的数据框类似于以下
new_df = spark.createDataFrame([
([{'product_code': '12', 'color': 'red'}, {'product_code': '212', 'color': 'white'}], 7),
([{'product_code': '1112', 'color': 'black'}], 8),
([{'product_code': '212', 'color': 'blue'}], 3)
], ["items", "frequency"])
我需要创建一个类似于以下的数据框,以便我可以轻松保存到csv :(相同列表数据的规则编号相同)
+-------------------------------------------
# |rule | product_code |color |
# +-------------------------------------------
# |1 | 12 | red |
# |1 | 212 | white|
# |2 | 1122 | black|
# |3 | 212 | blue |
# +--------------------------------------------
答案 0 :(得分:4)
您可以添加添加monotonically_increasing_id
作为标识符和explode
:
from pyspark.sql.functions import explode, monotonically_increasing_id, col
(new_df
.withColumn("rule", monotonically_increasing_id())
.withColumn("items", explode("items"))
.select(
"rule",
col("items")["product_code"].alias("product_code"),
col("items")["color"].alias("color"))
.show())
# +-----------+------------+-----+
# | rule|product_code|color|
# +-----------+------------+-----+
# | 8589934592| 12| red|
# | 8589934592| 212|white|
# |17179869184| 1112|black|
# |25769803776| 212| blue|
# +-----------+------------+-----+
使用zipWithIndex
可以获得连续的ID,但需要与Python RDD进行昂贵的转换。