我有一个具有以下结构的DataFrame:
| id | time | x | y |
-----------------------------
| 1 | 1 | 0 | 3 |
| 1 | 2 | 3 | 2 |
| 1 | 5 | 6 | 1 |
| 2 | 1 | 3 | 7 |
| 2 | 2 | 1 | 9 |
| 3 | 1 | 7 | 5 |
| 3 | 2 | 9 | 3 |
| 3 | 7 | 2 | 5 |
| 3 | 8 | 4 | 7 |
| 4 | 1 | 7 | 9 |
| 4 | 2 | 9 | 0 |
我要实现的是为每条记录创建另外三列,其中包含下一列的time, x, y
(基于time
)。要注意的是,只有它们具有相同的id
值时,我们才取下一条记录,否则应将新的三列设置为null
这是我要获取的输出
| id | time | x | y | time+1 | x+1 | y+1 |
--------------------------------------------------
| 1 | 1 | 0 | 3 | 2 | 3 | 2 |
| 1 | 2 | 3 | 2 | 5 | 6 | 1 |
| 1 | 5 | 6 | 1 | null | null| null|
| 2 | 1 | 3 | 7 | 2 | 1 | 9 |
| 2 | 2 | 1 | 9 | null | null| null|
| 3 | 1 | 7 | 5 | 2 | 9 | 3 |
| 3 | 2 | 9 | 3 | 7 | 2 | 5 |
| 3 | 7 | 2 | 5 | 8 | 4 | 7 |
| 3 | 8 | 4 | 7 | null | null| null|
| 4 | 1 | 7 | 9 | 2 | 9 | 0 |
| 4 | 2 | 9 | 0 | null | null| null|
是否可以使用Spark DataFrames实现此目的?
答案 0 :(得分:1)
您可以使用窗口功能线。 首先通过使用id列进行分区来创建窗口,然后在调用withColumn函数时使用要显示为偏移值为1的列。
类似这样的东西:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('id).orderBy('time)
dataset.withColumn("time1", lead('time, 1) over windowSpec).show
您可以通过相同的方式添加其他列
答案 1 :(得分:1)
如果您熟悉SQL,只需创建一个临时视图并一次性创建所有列。检查一下
scala> val df = Seq((1,1,0,3),(1,2,3,2),(1,5,6,1),(2,1,3,7),(2,2,1,9),(3,1,7,5),(3,2,9,3),(3,7,2,5),(3,8,4,7),(4,1,7,9),(4,2,9,0)).toDF("id","time","x","y")
df: org.apache.spark.sql.DataFrame = [id: int, time: int ... 2 more fields]
scala> df.createOrReplaceTempView("m2008")
scala> spark.sql(""" select *, lead(time) over(partition by id order by time) timep1,lead(x) over(partition by id order by time) xp1, lead(y) over(partition by id order by time) yp1 from m2008 """).show(false)
+---+----+---+---+------+----+----+
|id |time|x |y |timep1|xp1 |yp1 |
+---+----+---+---+------+----+----+
|1 |1 |0 |3 |2 |3 |2 |
|1 |2 |3 |2 |5 |6 |1 |
|1 |5 |6 |1 |null |null|null|
|3 |1 |7 |5 |2 |9 |3 |
|3 |2 |9 |3 |7 |2 |5 |
|3 |7 |2 |5 |8 |4 |7 |
|3 |8 |4 |7 |null |null|null|
|4 |1 |7 |9 |2 |9 |0 |
|4 |2 |9 |0 |null |null|null|
|2 |1 |3 |7 |2 |1 |9 |
|2 |2 |1 |9 |null |null|null|
+---+----+---+---+------+----+----+
scala>
只需分配spark.sql结果,您就可以将其作为另一个数据框获取
scala> val df2 = spark.sql(""" select *, lead(time) over(partition by id order by time) timep1,lead(x) over(partition by id order by time) xp1, lead(y) over(partition by id order by time) yp1 from m2008 """)
df2: org.apache.spark.sql.DataFrame = [id: int, time: int ... 5 more fields]
scala> df2.printSchema
root
|-- id: integer (nullable = false)
|-- time: integer (nullable = false)
|-- x: integer (nullable = false)
|-- y: integer (nullable = false)
|-- timep1: integer (nullable = true)
|-- xp1: integer (nullable = true)
|-- yp1: integer (nullable = true)
scala>
答案 2 :(得分:0)
在 scala 中,您也可以这样做:
scala>导入org.apache.spark.sql.expressions.Window
scala> val part = Window.partitionBy('id).orderBy('time)
scala> spark.read.format(“ csv”).option(“ inferSchema”,“ true”).option(“ header”,true).load(“ file:/// home / ec2-user / test.csv“)。withColumn(” time1“,lead('time,1)over part).withColumn(” x + 1“,lead('x,1)over part).withColumn(” y + 1“,导致('y,1)超过部分).show()
您还可以在下面检查我已熟练使用的快照: