Question

我想做这样的事情：

df
.withColumn("newCol", <some formula>)
.filter(s"""newCol > ${(math.min(max("newCol").asInstanceOf[Double],10))}""")

我得到的例外：

org.apache.spark.sql.Column cannot be cast to java.lang.Double

你能告诉我一种实现我想要的方法吗？

Answer 1

我认为newCol中已存在df，然后是：

import org.apache.spark.sql.expressions.Window   
import org.apache.spark.sql.functions._

df
.withColumn("max_newCol",max($"newCol").over(Window.partitionBy()))
.filter($"newCol"> least($"max_newCol",lit(10.0)))

您也可以jjst写max($"newCol").over(Window.partitionBy())

而不是max($"newCol").over()

Answer 2

解决方案分为两部分，

第一部分
找到最大值， df.select(max($"col1")).first()(0)

第二部分
使用该值对其进行过滤
df.filter($"col1" === df.select(max($"col1")).first()(0)).show

奖金
为避免潜在错误，您还可以在您需要的特定格式中获取最大值，在其上使用 .get 系列 df.select(max($"col1")).first.getDouble(0)
在这种情况下，col1 是 DoubleType，所以我选择以正确的格式选择它。您可以获得几乎所有其他类型。选项是：
getBoolean、getClass、getDecimal、getFloat、getJavaMap、getLong、getSeq、getString、getTimestamp、getAs、getByte、getDate、getDouble、getInt、getList、getMap、getShort、getStruct、getValuesMap
在这种情况下制定完整的解决方案
df.filter($"col1" === df.select(max($"col1")).first.getDouble(0)).show

Answer 3

我认为数据框describe function正是您所寻找的。

ds.describe("age", "height").show()

// output:  
// summary age   height  
// count   10.0  10.0  
// mean    53.3  178.05  
// stddev  11.6  15.7  
// min     18.0  163.0  
// max     92.0  192.0

Answer 4

我将两个步骤分开：

val newDF = df
 .withColumn("newCol", <some formula>)

// Spark 2.1 or later
// With 1.x use join
newDf.alias("l").crossJoin(
  newDf.alias("r")).where($"l.newCol" > least($"r.newCol", lit(10.0)))

或

newDf.where(
  $"newCol" > (newDf.select(max($"newCol")).as[Double].first min 10.0))

根据列的最大值过滤火花数据帧

4 个答案: