我有一个如下所示的数据集,在DataFrame的情况下,我可以轻松地舍入到2个小数位 但只是想知道在使用类型化数据集时是否有更简单的方法来做同样的事情。
这是我的代码段:
import org.apache.spark.sql.{DataFrame, Dataset}
import org.apache.spark.sql.expressions.scalalang.typed.{sum => typedSum}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DecimalType}
case class Record(BOOK: String,ID: String,CCY: String,AMT: Double)
def getDouble(num: Double) = {BigDecimal(num).setScale(2, BigDecimal.RoundingMode.HALF_UP).toDouble}
val ds = Seq(
("ALBIBC","1950363","USD",2339055.7945),
("ALBIBC","1950363","USD",78264623778.813345),
("ALBIBC","1950363","USD",45439055.222),
("ALBIBC","1950363","EUR",746754759055.343),
("ALBIBC","1950363","EUR",343439055.88780),
).toDS("BOOK","ID","CCY","AMT")
Dataframe方式产生以下输出:
val df: DataFrame = data.groupBy('BOOK,'ID,'CCY).agg(sum('AMT).cast(DecimalType(38,2)).as("Balance"))
df.show()
+------+-------+---+---------------+
| BOOK| ID|CCY| Balance|
+------+-------+---+---------------+
|ALBIBC|1950363|USD| 78312401889.83|
|ALBIBC|1950363|EUR|747098198111.23|
+------+-------+---+---------------+
如果是数据集,我如何将余额四舍五入到小数点后两位?
val sumBalance = typedSum[Record](_.AMT).as[Double].name("Balance")
val ds = data.groupByKey(thor => (thor.BOOK, thor.ID, thor.CCY)).agg(sumBalance.name("Balance"))
.map{case(key,value) => (key._1,key._2,key._3,getDouble(value))}
ds.show()
+------+-------+---+------------------+
| _1| _2| _3| _4|
+------+-------+---+------------------+
|ALBIBC|1950363|USD| 7.831240188983E10|
|ALBIBC|1950363|EUR|7.4709819811123E11|
+------+-------+---+------------------+
我可以使用数据帧方式,但只是好奇地知道使用数据集? 对此有任何建议。
由于
答案 0 :(得分:0)
您的错误是转换回Double
。浮点表示cannot represent all possible numbers。
将您的功能重新定义(并可能重命名)为:
def getDouble(num: Double) = BigDecimal(num).setScale(
2, BigDecimal.RoundingMode.HALF_UP
)
示例:
Seq(7.831240188983E10, 7.4709819811123E11).toDS.map(getDouble).show
// +---------------+
// | value|
// +---------------+
// | 78312401889.83|
// |747098198111.23|
// +---------------+