从特定列scala spark数据帧获取最小值和最大值

时间:2017-04-05 13:15:55

标签: scala apache-spark dataframe max

我想从我的数据框中访问特定列的最小值和最大值但是我没有列的标题,只是它的数字,所以我应该使用scala吗?

可能是这样的:

val q = nextInt(ncol) //we pick a random value for a column number
col = df(q)
val minimum = col.min()

很抱歉,如果这听起来像是一个愚蠢的问题,但我无法找到有关此问题的任何信息:/

6 个答案:

答案 0 :(得分:21)

如何从元数据中获取列名称:

val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))

答案 1 :(得分:19)

您可以在分配变量时使用模式匹配:

import org.apache.spark.sql.functions.{min, max}
import org.apache.spark.sql.Row

val Row(minValue: Double, maxValue: Double) = df.agg(min(q), max(q)).head

其中q是Column或列名(String)。假设您的数据类型为Double

答案 2 :(得分:6)

您可以使用列号首先提取列名(通过索引df.columns),然后使用列名进行聚合:

val df = Seq((2.0, 2.1), (1.2, 1.4)).toDF("A", "B")
// df: org.apache.spark.sql.DataFrame = [A: double, B: double]

df.agg(max(df(df.columns(1))), min(df(df.columns(1)))).show
+------+------+

|max(B)|min(B)|
+------+------+
|   2.1|   1.4|
+------+------+

答案 3 :(得分:3)

以下是从列名称的数据框中获取最小值和最大值的直接方法:

val df = Seq((1, 2), (3, 4), (5, 6)).toDF("A", "B")

df.show()
/*
+---+---+
|  A|  B|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
+---+---+
*/

df.agg(min("A"), max("A")).show()
/*
+------+------+
|min(A)|max(A)|
+------+------+
|     1|     5|
+------+------+
*/

如果您想将最小值和最大值作为单独的变量,那么您可以将上面agg()的结果转换为Row并使用Row.getInt(index)来获取列值Row

val min_max = df.agg(min("A"), max("A")).head()
// min_max: org.apache.spark.sql.Row = [1,5]

val col_min = min_max.getInt(0)
// col_min: Int = 1

val col_max = min_max.getInt(1)
// col_max: Int = 5

答案 4 :(得分:0)

在Java中,我们必须明确提及org.apache.spark.sql.functions,它已实现了minmax

datasetFreq.agg(functions.min("Frequency"), functions.max("Frequency")).show();

答案 5 :(得分:0)

希望这会有所帮助

val sales=sc.parallelize(List(
   ("West",  "Apple",  2.0, 10),
   ("West",  "Apple",  3.0, 15),
   ("West",  "Orange", 5.0, 15),
   ("South", "Orange", 3.0, 9),
   ("South", "Orange", 6.0, 18),
   ("East",  "Milk",   5.0, 5)))



val salesDf= sales.toDF("store","product","amount","quantity")

salesDf.registerTempTable("sales") 

val result=spark.sql("SELECT store, product, SUM(amount), MIN(amount), MAX(amount), SUM(quantity) from sales GROUP BY store, product")


//OR

salesDf.groupBy("store","product").agg(min("amount"),max("amount"),sum("amount"),sum("quantity")).show


//output
    +-----+-------+-----------+-----------+-----------+-------------+
    |store|product|min(amount)|max(amount)|sum(amount)|sum(quantity)|
    +-----+-------+-----------+-----------+-----------+-------------+
    |South| Orange|        3.0|        6.0|        9.0|           27|
    | West| Orange|        5.0|        5.0|        5.0|           15|
    | East|   Milk|        5.0|        5.0|        5.0|            5|
    | West|  Apple|        2.0|        3.0|        5.0|           25|
    +-----+-------+-----------+-----------+-----------+-------------+