使用Spark Dataframe(Scala)进行简单的下滚

时间:2016-12-15 06:27:07

标签: scala apache-spark dataframe

如果您有一个如下所示的简单数据框:

val n = sc.parallelize(List[String](
    "Alice", null, null, 
    "Bob", null, null,
    "Chuck"
    )).toDF("name")

看起来像这样:

//+-----+
//| name|
//+-----+
//|Alice|
//| null|
//| null|
//|  Bob|
//| null|
//| null|
//|Chuck|
//+-----+

如何使用数据框下拉功能获取:

//+-----+
//| name|
//+-----+
//|Alice|
//|Alice|
//|Alice|
//|  Bob|
//|  Bob|
//|  Bob|
//|Chuck|
//+-----+

注意:请说明任何所需的进口,我怀疑这些包括:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{WindowSpec, Window}

注意:我试图模仿的一些网站是:

http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html

https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

过去我遇到过这样的事情所以我发现Spark的版本会有所不同。我在集群中使用1.5.2(此解决方案更有用)和本地模拟中的2.0。我更喜欢兼容1.5.2的解决方案。

另外,我想避免直接编写SQL - 避免使用sqlContext.sql(...)

1 个答案:

答案 0 :(得分:1)

如果您有另一个允许对值进行分组的列,请在此处提出建议:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._

val df = Seq(
  (Some("Alice"), 1),
  (None, 1), 
  (None, 1), 
  (Some("Bob"), 2), 
  (None, 2), 
  (None, 2), 
  (Some("Chuck"), 3)
).toDF("name", "group")

val result = df.withColumn("new_col", min(col("name")).over(Window.partitionBy("group")))

result.show()

+-----+-----+-------+
| name|group|new_col|
+-----+-----+-------+
|Alice|    1|  Alice|
| null|    1|  Alice|
| null|    1|  Alice|
|  Bob|    2|    Bob|
| null|    2|    Bob|
| null|    2|    Bob|
|Chuck|    3|  Chuck|
+-----+-----+-------+

另一方面,如果您只有一个允许排序但不进行分组的列,则解决方案会稍微困难一些。我的第一个想法是创建一个子集,然后进行连接:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._

val df = Seq(
  (Some("Alice"), 1),
  (None, 2), 
  (None, 3), 
  (Some("Bob"), 4), 
  (None, 5), 
  (None, 6), 
  (Some("Chuck"), 7)
).toDF("name", "order")

val subset = df
  .select("name", "order")
  .where(col("name").isNotNull)
  .withColumn("next", lead("order", 1).over(Window.orderBy("order")))

val partial = df.as("a")
  .join(subset.as("b"), col("a.order") >= col("b.order") && (col("a.order") < subset("next")), "left")
val result = partial.select(coalesce(col("a.name"), col("b.name")).as("name"), col("a.order"))

result.show()

+-----+-----+
| name|order|
+-----+-----+
|Alice|    1|
|Alice|    2|
|Alice|    3|
|  Bob|    4|
|  Bob|    5|
|  Bob|    6|
|Chuck|    7|
+-----+-----+