假设我具有以下DataFrame:
aEnd
这是带有注释的相同样本DF,按+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 3|null| 11|
| 2| null| 2| xxx| 22|
| 1| null| 1| yyy|null|
| 2| null| 7|null| 33|
| 1| null| 12|null|null|
| 2| null| 19|null| 77|
| 1| null| 10| s13|null|
| 2| null| 11| a23|null|
+---+--------+---+----+----+
和grp
排序:
ord
我想压缩它。即要将其按列scala> df.orderBy("grp", "ord").show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 1| yyy|null|
| 1| null| 3|null| 11| # grp:1 - last value for `col2` (11)
| 1| null| 10| s13|null| # grp:1 - last value for `col1` (s13)
| 1| null| 12|null|null| # grp:1 - last values for `null_col`, `ord`
| 2| null| 2| xxx| 22|
| 2| null| 7|null| 33|
| 2| null| 11| a23|null| # grp:2 - last value for `col1` (a23)
| 2| null| 19|null| 77| # grp:2 - last values for `null_col`, `ord`, `col2`
+---+--------+---+----+----+
进行分组,并针对每个组,按"grp"
列对行进行排序,并获取每列中的最后一个"ord"
值(如果有的话)。
not null
我已经看到以下类似的问题:
但是我真正的DataFrame有超过250列,所以我需要一个无需显式指定所有列的解决方案。
我不能把头缠住...
MCVE:如何创建示例数据框:
readSparkOutput()
:将“ /tmp/data.txt”解析为DataFrame:
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 12| s13| 11|
| 2| null| 19| a23| 77|
+---+--------+---+----+----+
更新:我认为它应该类似于以下SQL:
val df = readSparkOutput("file:///tmp/data.txt")
我们如何动态生成这样的Spark查询?
我尝试了以下简化方法:
SELECT
grp, ord, null_col, col1, col2
FROM (
SELECT
grp,
ord,
FIRST(null_col) OVER (PARTITION BY grp ORDER BY ord DESC) as null_col,
FIRST(col1) OVER (PARTITION BY grp ORDER BY ord DESC) as col1,
FIRST(col2) OVER (PARTITION BY grp ORDER BY ord DESC) as col2,
ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ord DESC) as rn
FROM table_name) as v
WHERE v.rn = 1;
产生:
import org.apache.spark.sql.expressions.Window
val win = Window
.partitionBy("grp")
.orderBy($"ord".desc)
val cols = df.columns.map(c => first(c, ignoreNulls=true).over(win).as(c))
但我无法将其传递给scala> cols
res23: Array[org.apache.spark.sql.Column] = Array(first(grp, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `grp`, first(null_col, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `null_col`, first(ord, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `ord`, first(col1, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col1`, first(col2, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col2`)
:
df.select
另一种尝试:
scala> df.select(cols.head, cols.tail: _*).show
<console>:34: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
df.select(cols.head, cols.tail: _*).show
答案 0 :(得分:3)
请考虑以下方法,将按“ ord” /“ grp”排序的窗口函数last(c, ignoreNulls=true)
应用于每个选定列;随后是groupBy("grp")
,以获取first
see screenshot here结果:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df0 = Seq(
(1, 3, None, Some(11)),
(2, 2, Some("aaa"), Some(22)),
(1, 1, Some("s12"), None),
(2, 7, None, Some(33)),
(1, 12, None, None),
(2, 19, None, Some(77)),
(1, 10, Some("s13"), None),
(2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")
val df = df0.withColumn("null_col", lit(null))
df.orderBy("grp", "ord").show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// | 1| 1| s12|null| null|
// | 1| 3|null| 11| null|
// | 1| 10| s13|null| null|
// | 1| 12|null|null| null|
// | 2| 2| aaa| 22| null|
// | 2| 7|null| 33| null|
// | 2| 11| a23|null| null|
// | 2| 19|null| 77| null|
// +---+---+----+----+--------+
val win = Window.partitionBy("grp").orderBy("ord").
rowsBetween(0, Window.unboundedFollowing)
val nonAggCols = Array("grp")
val cols = df.columns.diff(nonAggCols) // Columns to be aggregated
val colFcnMap = cols.zip(Array.fill(cols.size)("first")).toMap
// colFcnMap: scala.collection.immutable.Map[String,String] =
// Map(ord -> first, col1 -> first, col2 -> first, null_col -> first)
cols.foldLeft(df)((acc, c) =>
acc.withColumn(c, last(c, ignoreNulls=true).over(win))
).
groupBy("grp").agg(colFcnMap).
select(col("grp") :: colFcnMap.toList.map{case (c, f) => col(s"$f($c)").as(c)}: _*).
show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// | 1| 12| s13| 11| null|
// | 2| 19| a23| 77| null|
// +---+---+----+----+--------+
请注意,最后的select
是用于从聚合列名中删除函数名(在这种情况下为first()
)。
答案 1 :(得分:3)
我已经解决了一些问题,这是代码和输出
import org.apache.spark.sql.functions._
import spark.implicits._
val df0 = Seq(
(1, 3, None, Some(11)),
(2, 2, Some("aaa"), Some(22)),
(1, 1, Some("s12"), None),
(2, 7, None, Some(33)),
(1, 12, None, None),
(2, 19, None, Some(77)),
(1, 10, Some("s13"), None),
(2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")
df0.show()
//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//| 1| 3|null| 11|
//| 2| 2| aaa| 22|
//| 1| 1| s12|null|
//| 2| 7|null| 33|
//| 1| 12|null|null|
//| 2| 19|null| 77|
//| 1| 10| s13|null|
//| 2| 11| a23|null|
//+---+---+----+----+
在前两列上排序数据
val df1 = df0.select("grp", "ord", "col1", "col2").orderBy("grp", "ord")
df1.show()
//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//| 1| 1| s12|null|
//| 1| 3|null| 11|
//| 1| 10| s13|null|
//| 1| 12|null|null|
//| 2| 2| aaa| 22|
//| 2| 7|null| 33|
//| 2| 11| a23|null|
//| 2| 19|null| 77|
//+---+---+----+----+
val df2 = df1.groupBy("grp").agg(max("ord").alias("ord"),collect_set("col1").alias("col1"),collect_set("col2").alias("col2"))
val df3 = df2.withColumn("new_col1",$"col1".apply(size($"col1").minus(1))).withColumn("new_col2",$"col2".apply(size($"col2").minus(1)))
df3.show()
//+---+---+----------+------------+--------+--------+
//|grp|ord| col1| col2|new_col1|new_col2|
//+---+---+----------+------------+--------+--------+
//| 1| 12|[s12, s13]| [11]| s13| 11|
//| 2| 19|[aaa, a23]|[33, 22, 77]| a23| 77|
//+---+---+----------+------------+--------+--------+
您可以使用 .drop(“ column_name”)
删除不需要的列答案 2 :(得分:2)
因此,这里我们按a分组,然后选择组中所有其他列的最大值:
scala> val df = List((1,2,11), (1,1,1), (2,1,4), (2,3,5)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val aggCols = df.schema.map(_.name).filter(_ != "a").map(colName => sum(col(colName)).alias(s"max_$colName"))
aggCols: Seq[org.apache.spark.sql.Column] = List(sum(b) AS `max_b`, sum(c) AS `max_c`)
scala> df.groupBy(col("a")).agg(aggCols.head, aggCols.tail: _*)
res0: org.apache.spark.sql.DataFrame = [a: int, max_b: bigint ... 1 more field]
答案 3 :(得分:1)
这是您的答案(希望我的赏金!!!)
scala> val df = spark.sparkContext.parallelize(List(
| (1,null.asInstanceOf[String],3,null.asInstanceOf[String],new Integer(11)),
| (2,null.asInstanceOf[String],2,new String("xxx"),new Integer(22)),
| (1,null.asInstanceOf[String],1,new String("yyy"),null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],7,null.asInstanceOf[String],new Integer(33)),
| (1,null.asInstanceOf[String],12,null.asInstanceOf[String],null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],19,null.asInstanceOf[String],new Integer(77)),
| (1,null.asInstanceOf[String],10,new String("s13"),null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],11,new String("a23"),null.asInstanceOf[Integer]))).toDF("grp","null_col","ord","col1","col2")
df: org.apache.spark.sql.DataFrame = [grp: int, null_col: string ... 3 more fields]
scala> df.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 3|null| 11|
| 2| null| 2| xxx| 22|
| 1| null| 1| yyy|null|
| 2| null| 7|null| 33|
| 1| null| 12|null|null|
| 2| null| 19|null| 77|
| 1| null| 10| s13|null|
| 2| null| 11| a23|null|
+---+--------+---+----+----+
//创建窗口规范
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("grp").orderBy($"ord".desc)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@71878833
///在所有列上都使用foldLeft和first over window规范,并采用不同的
scala> val result = df.columns.foldLeft(df)((df, colName) => df.withColumn(colName, first(colName, ignoreNulls=true).over(win).as(colName))).distinct
result: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [grp: int, null_col: string ... 3 more fields]
scala> result.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 12| s13| 11|
| 2| null| 19| a23| 77|
+---+--------+---+----+----+
希望这会有所帮助。
答案 4 :(得分:1)
我会使用与@LeoC相同的方法,但是我认为不需要将列名作为字符串进行操作,而我会使用更像spark-sql的答案。
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, last}
val win = Window.partitionBy("grp").orderBy(col("ord")).rowsBetween(0, Window.unboundedFollowing)
// In case there is more than one group column
val nonAggCols = Seq("grp")
// Select columns to aggregate on
val cols: Seq[String] = df.columns.diff(nonAggCols).toSeq
// Map over selection and apply fct
val aggregations: Seq[Column] = cols.map(c => first(col(c), ignoreNulls = true).as(c))
// I'd rather cache the following step as it might get expensive
val step1 = cols.foldLeft(df)((acc, c) => acc.withColumn(c, last(col(c), ignoreNulls = true).over(win))).cache
// Finally we can aggregate our results as followed
val results = step1.groupBy(nonAggCols.head, nonAggCols.tail: _*).agg(aggregations.head, aggregations.tail: _*)
results.show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// | 1| null| 12| s13| 11|
// | 2| null| 19| a23| 77|
// +---+--------+---+----+----+
我希望这会有所帮助。
编辑:之所以无法获得相同的结果,是因为您使用的阅读器不正确。
它将文件中的null
解释为字符串而不是null
;即:
scala> df.filter('col1.isNotNull).show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// | 1| null| 3|null| 11|
// | 2| null| 2| xxx| 22|
// | 1| null| 1| yyy|null|
// | 2| null| 7|null| 33|
// | 1| null| 12|null|null|
// | 2| null| 19|null| 77|
// | 1| null| 10| s13|null|
// | 2| null| 11| a23|null|
// +---+--------+---+----+----+
这是我的readSparkOutput
版本:
def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
val step1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "|")
.option("parserLib", "UNIVOCITY")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("comment", "+")
.csv(filePath)
val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)
val columns = step2.columns
columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null", col(c))))
}