下面是数据输入,
| start | format_date | diff|
+-------------------+-------------------+--------+
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4 |
预期输出:
start format_date Diff seq
2019-11-15 20:30:00 2019-11-15 18:30:00 4 1
2019-11-15 20:30:00 2019-11-15 18:30:00 4 2
2019-11-15 20:30:00 2019-11-15 18:30:00 4 3
2019-11-15 20:30:00 2019-11-15 18:30:00 4 4
如何根据列的值(diff)生成行?
答案 0 :(得分:1)
Spark 2.4或更高版本的解决方案
from pyspark.sql import functions as F
from pyspark.sql.types import *
df= spark.createDataFrame([["2019-11-15 20:30:00","2019-11-15 18:30:00" ,4]], ["start", "format_date", "diff"])
df.select("*", F.explode(F.sequence(F.lit(1), F.col("diff"))).alias("seq")).show
+-------------------+-------------------+----+---+
| start| format_date|diff|seq|
+-------------------+-------------------+----+---+
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 1|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 2|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 3|
|2019-11-15 20:30:00|2019-11-15 18:30:00| 4| 4|
答案 1 :(得分:0)
火花<2.4
您可以使用爆炸功能
import pyspark.sql.functions as F
import pyspark.sql.types as Types
def rangeArr(diff):
return range(1,diff+1)
rangeUdf = F.udf(rangeArr, Types.ArrayType(Types.IntegerType()))
df = df.withColumn('seqArr', rangeUdf('diff'))
df = df.withColumn('seq', F.explode('seqArr'))