如何迭代地转换spark rdd的子矩阵?

时间:2017-03-14 06:21:43

标签: pyspark

例如, 来自:

+-----+-----+  
|Date |val_1|  
+-----+-----+  
| 1-1 |  1.1|   
| 1-2 |  1.2|    
| 1-3 |  1.3|  
| 1-4 |  1.4|  
| 1-5 |  1.5|  
| 1-6 |  1.6|  
| 1-7 |  1.7|  
| 1-8 |  1.8|  
| 1-9 |  1.9|  
|  ...|  ...| 

致:

+-----+-----+-----+-------+  
| Date | D-3  | D-2  | D-1    |  
+-----+-----+-----+-------+  
| 1-4 | 1.1 | 1.2 | 1.3  |  
| 1-5 | 1.2 | 1.3 | 1.4  |  
| 1-6 | 1.3 | 1.4 | 1.5  |  
| 1-7 | 1.4 | 1.5 | 1.6  |  
| 1-8 | 1.5 | 1.6 | 1.7  |  
| 1-9 | 1.6 | 1.7 | 1.8  |  
| ... | ... | ... | ...  |  

提前多多感谢。

2 个答案:

答案 0 :(得分:2)

您的问题并不完全清楚,特别是对于您所追求的迭代解决方案。但是,对于提供的示例数据:

df = sc.parallelize([('1-1', 1.1), ('1-2', 1.2), ('1-3', 1.3), ('1-4', 1.4), ('1-5', 1.5), ('1-6', 1.6),('1-7', 1.7),('1-8', 1.8),('1-9', 1.9)]).toDF(["Date", "val_1"])

您可以将lagWindow结合使用来检索D-3D-2D-1

from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

w = Window().partitionBy().orderBy(col("Date"))
dfl = df.select("Date", lag("val_1",count=3).over(w).alias("D-3"),
                     lag("val_1",count=2).over(w).alias("D-2"),
                     lag("val_1",count=1).over(w).alias("D-1")).na.drop()
dfl.show()                     

这导致以下输出:

+----+---+---+---+
|Date|D-3|D-2|D-1|
+----+---+---+---+
| 1-4|1.1|1.2|1.3|
| 1-5|1.2|1.3|1.4|
| 1-6|1.3|1.4|1.5|
| 1-7|1.4|1.5|1.6|
| 1-8|1.5|1.6|1.7|
| 1-9|1.6|1.7|1.8|
+----+---+---+---+

答案 1 :(得分:1)

感谢Jaco的灵感。 这是Scala版本:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val df = sc.parallelize(Seq(("1-1", 1.1), ("1-2", 1.2), ("1-3", 1.3), ("1-4", 1.4), ("1-5", 1.5), ("1-6", 1.6),("1-7", 1.7),("1-8", 1.8),("1-9", 1.9))).toDF("Date", "val_1")
val w = Window.partitionBy().orderBy("Date")
val res = df.withColumn("D-3", lag("val_1", 3, 0).over(w)).withColumn("D-2", lag("val_1", 2, 0).over(w)).withColumn("D-1", lag("val_1", 1, 0).over(w)).na.drop()

结果:

+----+-----+---+---+---+
|Date|val_1|D-3|D-2|D-1|
+----+-----+---+---+---+
| 1-4|  1.4|1.1|1.2|1.3|
| 1-5|  1.5|1.2|1.3|1.4|
| 1-6|  1.6|1.3|1.4|1.5|
| 1-7|  1.7|1.4|1.5|1.6|
| 1-8|  1.8|1.5|1.6|1.7|
| 1-9|  1.9|1.6|1.7|1.8|
+----+-----+---+---+---+