Spark数据集-汇总查询以将BigInt总和求和为零

时间:2020-06-10 15:01:24

标签: apache-spark-sql apache-spark-dataset

我有一个ExpenseEntry类型的数据集。 ExpenseEntry是基本的dat结构,用于跟踪在每个amount上花费的category

case class ExpenseEntry(
    name: String,
    category: String,
    amount: BigDecimal
)

示例值-

ExpenseEntry("John", "candy", 0.5)
ExpenseEntry("Tia", "game", 0.25)
ExpenseEntry("John", "candy", 0.15)
ExpenseEntry("Tia", "candy", 0.55)

预期答案是

category - name - amount
candy - John - 0.65
candy - Tia - 0.55
game - Tia - 0.25

我想做的是,获取每个原因每个名字花费的总金额。所以,我有下面的数据集查询

dataset.groupBy("category", "name").agg(sum("amount"))

从理论上讲,该查询对我来说似乎是正确的。但是,总和显示为0E-18,其为0。我猜想该金额正被int函数内部转换为sum。如何将其投放到BigInt?我对这个问题的理解正确吗?

2 个答案:

答案 0 :(得分:1)

package spark

import org.apache.spark.sql.{DataFrame, SparkSession}

object SumBig extends App{

  val spark = SparkSession.builder()
    .master("local")
    .appName("DataFrame-example")
    .getOrCreate()

  import spark.implicits._

  case class ExpenseEntry(
                           name: String,
                           category: String,
                           amount: BigDecimal
                         )
  val df = Seq(
  ExpenseEntry("John", "candy", 0.5),
  ExpenseEntry("Tia", "game", 0.25),
  ExpenseEntry("John", "candy", 0.15),
  ExpenseEntry("Tia", "candy", 0.55)
  ).toDF()

  df.show(false)

  val r = df.groupBy("category", "name").sum("amount")
  r.show(false)

//      +--------+----+--------------------+
//      |category|name|sum(amount)         |
//      +--------+----+--------------------+
//      |game    |Tia |0.250000000000000000|
//      |candy   |John|0.650000000000000000|
//      |candy   |Tia |0.550000000000000000|
//      +--------+----+--------------------+

}

答案 1 :(得分:1)

  1. 您可以使用bound()限制小数位
  2. 总和不会将列的数据类型从十进制更改为整数。
df.groupBy("category", "name").agg(  sum(bround( col("amount"),2) ).as("sum_amount")).show()
相关问题