例如, 来自:
+-----+-----+
|Date |val_1|
+-----+-----+
| 1-1 | 1.1|
| 1-2 | 1.2|
| 1-3 | 1.3|
| 1-4 | 1.4|
| 1-5 | 1.5|
| 1-6 | 1.6|
| 1-7 | 1.7|
| 1-8 | 1.8|
| 1-9 | 1.9|
| ...| ...|
致:
+-----+-----+-----+-------+
| Date | D-3 | D-2 | D-1 |
+-----+-----+-----+-------+
| 1-4 | 1.1 | 1.2 | 1.3 |
| 1-5 | 1.2 | 1.3 | 1.4 |
| 1-6 | 1.3 | 1.4 | 1.5 |
| 1-7 | 1.4 | 1.5 | 1.6 |
| 1-8 | 1.5 | 1.6 | 1.7 |
| 1-9 | 1.6 | 1.7 | 1.8 |
| ... | ... | ... | ... |
提前多多感谢。
答案 0 :(得分:2)
您的问题并不完全清楚,特别是对于您所追求的迭代解决方案。但是,对于提供的示例数据:
df = sc.parallelize([('1-1', 1.1), ('1-2', 1.2), ('1-3', 1.3), ('1-4', 1.4), ('1-5', 1.5), ('1-6', 1.6),('1-7', 1.7),('1-8', 1.8),('1-9', 1.9)]).toDF(["Date", "val_1"])
您可以将lag
与Window
结合使用来检索D-3
,D-2
和D-1
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window
w = Window().partitionBy().orderBy(col("Date"))
dfl = df.select("Date", lag("val_1",count=3).over(w).alias("D-3"),
lag("val_1",count=2).over(w).alias("D-2"),
lag("val_1",count=1).over(w).alias("D-1")).na.drop()
dfl.show()
这导致以下输出:
+----+---+---+---+
|Date|D-3|D-2|D-1|
+----+---+---+---+
| 1-4|1.1|1.2|1.3|
| 1-5|1.2|1.3|1.4|
| 1-6|1.3|1.4|1.5|
| 1-7|1.4|1.5|1.6|
| 1-8|1.5|1.6|1.7|
| 1-9|1.6|1.7|1.8|
+----+---+---+---+
答案 1 :(得分:1)
感谢Jaco的灵感。 这是Scala版本:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val df = sc.parallelize(Seq(("1-1", 1.1), ("1-2", 1.2), ("1-3", 1.3), ("1-4", 1.4), ("1-5", 1.5), ("1-6", 1.6),("1-7", 1.7),("1-8", 1.8),("1-9", 1.9))).toDF("Date", "val_1")
val w = Window.partitionBy().orderBy("Date")
val res = df.withColumn("D-3", lag("val_1", 3, 0).over(w)).withColumn("D-2", lag("val_1", 2, 0).over(w)).withColumn("D-1", lag("val_1", 1, 0).over(w)).na.drop()
结果:
+----+-----+---+---+---+
|Date|val_1|D-3|D-2|D-1|
+----+-----+---+---+---+
| 1-4| 1.4|1.1|1.2|1.3|
| 1-5| 1.5|1.2|1.3|1.4|
| 1-6| 1.6|1.3|1.4|1.5|
| 1-7| 1.7|1.4|1.5|1.6|
| 1-8| 1.8|1.5|1.6|1.7|
| 1-9| 1.9|1.6|1.7|1.8|
+----+-----+---+---+---+