如果您有一个如下所示的简单数据框:
val n = sc.parallelize(List[String](
"Alice", null, null,
"Bob", null, null,
"Chuck"
)).toDF("name")
看起来像这样:
//+-----+
//| name|
//+-----+
//|Alice|
//| null|
//| null|
//| Bob|
//| null|
//| null|
//|Chuck|
//+-----+
如何使用数据框下拉功能获取:
//+-----+
//| name|
//+-----+
//|Alice|
//|Alice|
//|Alice|
//| Bob|
//| Bob|
//| Bob|
//|Chuck|
//+-----+
注意:请说明任何所需的进口,我怀疑这些包括:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{WindowSpec, Window}
注意:我试图模仿的一些网站是:
http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html
和
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
过去我遇到过这样的事情所以我发现Spark的版本会有所不同。我在集群中使用1.5.2(此解决方案更有用)和本地模拟中的2.0。我更喜欢兼容1.5.2的解决方案。
另外,我想避免直接编写SQL - 避免使用sqlContext.sql(...)
答案 0 :(得分:1)
如果您有另一个允许对值进行分组的列,请在此处提出建议:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 1),
(None, 1),
(Some("Bob"), 2),
(None, 2),
(None, 2),
(Some("Chuck"), 3)
).toDF("name", "group")
val result = df.withColumn("new_col", min(col("name")).over(Window.partitionBy("group")))
result.show()
+-----+-----+-------+
| name|group|new_col|
+-----+-----+-------+
|Alice| 1| Alice|
| null| 1| Alice|
| null| 1| Alice|
| Bob| 2| Bob|
| null| 2| Bob|
| null| 2| Bob|
|Chuck| 3| Chuck|
+-----+-----+-------+
另一方面,如果您只有一个允许排序但不进行分组的列,则解决方案会稍微困难一些。我的第一个想法是创建一个子集,然后进行连接:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 2),
(None, 3),
(Some("Bob"), 4),
(None, 5),
(None, 6),
(Some("Chuck"), 7)
).toDF("name", "order")
val subset = df
.select("name", "order")
.where(col("name").isNotNull)
.withColumn("next", lead("order", 1).over(Window.orderBy("order")))
val partial = df.as("a")
.join(subset.as("b"), col("a.order") >= col("b.order") && (col("a.order") < subset("next")), "left")
val result = partial.select(coalesce(col("a.name"), col("b.name")).as("name"), col("a.order"))
result.show()
+-----+-----+
| name|order|
+-----+-----+
|Alice| 1|
|Alice| 2|
|Alice| 3|
| Bob| 4|
| Bob| 5|
| Bob| 6|
|Chuck| 7|
+-----+-----+