Spark DataFrames:合并两个连续的行

时间:2018-12-22 23:38:37

标签: apache-spark dataframe apache-spark-sql

我有一个具有以下结构的DataFrame:

|  id  |  time  |  x  |  y  |
-----------------------------
|  1   |   1    |  0  |  3  |
|  1   |   2    |  3  |  2  |
|  1   |   5    |  6  |  1  |
|  2   |   1    |  3  |  7  |
|  2   |   2    |  1  |  9  |
|  3   |   1    |  7  |  5  |
|  3   |   2    |  9  |  3  |
|  3   |   7    |  2  |  5  |
|  3   |   8    |  4  |  7  |
|  4   |   1    |  7  |  9  |
|  4   |   2    |  9  |  0  |

我要实现的是为每条记录创建另外三列,其中包含下一列的time, x, y(基于time)。要注意的是,只有它们具有相同的id值时,我们才取下一条记录,否则应将新的三列设置为null

这是我要获取的输出

|  id  |  time  |  x  |  y  | time+1 | x+1 | y+1 |
--------------------------------------------------
|  1   |   1    |  0  |  3  |   2    |  3  |  2  |
|  1   |   2    |  3  |  2  |   5    |  6  |  1  |
|  1   |   5    |  6  |  1  |  null  | null| null|
|  2   |   1    |  3  |  7  |   2    |  1  |  9  |
|  2   |   2    |  1  |  9  |  null  | null| null|
|  3   |   1    |  7  |  5  |   2    |  9  |  3  |
|  3   |   2    |  9  |  3  |   7    |  2  |  5  |
|  3   |   7    |  2  |  5  |   8    |  4  |  7  |
|  3   |   8    |  4  |  7  |  null  | null| null|
|  4   |   1    |  7  |  9  |   2    |  9  |  0  |
|  4   |   2    |  9  |  0  |  null  | null| null|

是否可以使用Spark DataFrames实现此目的?

3 个答案:

答案 0 :(得分:1)

您可以使用窗口功能线。 首先通过使用id列进行分区来创建窗口,然后在调用withColumn函数时使用要显示为偏移值为1的列。

类似这样的东西:

import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('id).orderBy('time)
dataset.withColumn("time1", lead('time, 1) over windowSpec).show

您可以通过相同的方式添加其他列

答案 1 :(得分:1)

如果您熟悉SQL,只需创建一个临时视图并一次性创建所有列。检查一下

scala> val df = Seq((1,1,0,3),(1,2,3,2),(1,5,6,1),(2,1,3,7),(2,2,1,9),(3,1,7,5),(3,2,9,3),(3,7,2,5),(3,8,4,7),(4,1,7,9),(4,2,9,0)).toDF("id","time","x","y")
df: org.apache.spark.sql.DataFrame = [id: int, time: int ... 2 more fields]

scala> df.createOrReplaceTempView("m2008")

scala> spark.sql(""" select *, lead(time) over(partition by id order by time) timep1,lead(x) over(partition by id order by time) xp1, lead(y) over(partition by id order by time) yp1 from m2008 """).show(false)
+---+----+---+---+------+----+----+
|id |time|x  |y  |timep1|xp1 |yp1 |
+---+----+---+---+------+----+----+
|1  |1   |0  |3  |2     |3   |2   |
|1  |2   |3  |2  |5     |6   |1   |
|1  |5   |6  |1  |null  |null|null|
|3  |1   |7  |5  |2     |9   |3   |
|3  |2   |9  |3  |7     |2   |5   |
|3  |7   |2  |5  |8     |4   |7   |
|3  |8   |4  |7  |null  |null|null|
|4  |1   |7  |9  |2     |9   |0   |
|4  |2   |9  |0  |null  |null|null|
|2  |1   |3  |7  |2     |1   |9   |
|2  |2   |1  |9  |null  |null|null|
+---+----+---+---+------+----+----+


scala>

只需分配spark.sql结果,您就可以将其作为另一个数据框获取

scala> val df2 = spark.sql(""" select *, lead(time) over(partition by id order by time) timep1,lead(x) over(partition by id order by time) xp1, lead(y) over(partition by id order by time) yp1 from m2008 """)
df2: org.apache.spark.sql.DataFrame = [id: int, time: int ... 5 more fields]

scala> df2.printSchema
root
 |-- id: integer (nullable = false)
 |-- time: integer (nullable = false)
 |-- x: integer (nullable = false)
 |-- y: integer (nullable = false)
 |-- timep1: integer (nullable = true)
 |-- xp1: integer (nullable = true)
 |-- yp1: integer (nullable = true)


scala>

答案 2 :(得分:0)

scala 中,您也可以这样做:

  

scala>导入org.apache.spark.sql.expressions.Window

     

scala> val part = Window.partitionBy('id).orderBy('time)

     

scala> spark.read.format(“ csv”).option(“ inferSchema”,“ true”).option(“ header”,true).load(“ file:/// home / ec2-user / test.csv“)。withColumn(” time1“,lead('time,1)over part).withColumn(” x + 1“,lead('x,1)over part).withColumn(” y + 1“,导致('y,1)超过部分).show()

您还可以在下面检查我已熟练使用的快照:

running snapshot of program using **windows lead function**