如何在BigInts中使用数据集?

时间:2017-04-17 16:01:40

标签: scala apache-spark apache-spark-dataset

尽我所能,我无法创建一个具有足够精度的案例类数据集来处理DecimalType(38,0)

我试过了:

case class BigId(id: scala.math.BigInt)

这会在ExpressionEncoder https://issues.apache.org/jira/browse/SPARK-20341

中遇到错误

我试过了:

case class BigId(id: java.math.BigDecimal)

但是这会遇到错误,唯一可能的精度是DecimalType(38,18)。我甚至创建了我的自定义编码器,从spark source code大量借用。最大的变化是我将java.math.BigDecimal的架构默认为DecimalType(38,0)。我找不到任何改变序列化器或反序列化器的理由。当我将自定义编码器提供给Dataset.asDataset.map时,我得到以下堆栈跟踪:

User class threw exception: org.apache.spark.sql.AnalysisException: Cannot up cast `id` from decimal(38,0) to decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "java.math.BigDecimal", name: "id")
- root class: "BigId"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
org.apache.spark.sql.AnalysisException: Cannot up cast `id` from decimal(38,0) to decimal(38,18) as it may truncate
The type path of the target object is:
- field (class: "java.math.BigDecimal", name: "id")
- root class: "BigId"
You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:1998)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2020)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34$$anonfun$applyOrElse$14.applyOrElse(Analyzer.scala:2015)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:286)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:285)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:328)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5$$anonfun$apply$11.apply(TreeNode.scala:357)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.immutable.List.map(List.scala:285)
    at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:355)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:326)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionDown$1(QueryPlan.scala:235)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:245)
    at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$7.apply(QueryPlan.scala:254)
    at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:186)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:254)
    at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:223)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2015)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$34.applyOrElse(Analyzer.scala:2011)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:2011)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.apply(Analyzer.scala:1996)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
    at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:124)
    at scala.collection.immutable.List.foldLeft(List.scala:84)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
    at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.resolveAndBind(ExpressionEncoder.scala:244)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:210)
    at org.apache.spark.sql.Dataset.<init>(Dataset.scala:167)
    at org.apache.spark.sql.Dataset$.apply(Dataset.scala:59)
    at org.apache.spark.sql.Dataset.as(Dataset.scala:359)

我可以确认我的输入DataFrame.schemaencoder.schema的精确度均为DecimalType(38,0)。我还删除了所有import spark.implicits._,以确认DataFrame方法正在使用我的自定义编码器。

此时,似乎剩下的最简单的选择是将id作为String传递。这似乎很浪费。

1 个答案:

答案 0 :(得分:0)

虽然我很钦佩你定义自定义编码器的想法,但这是不必要的。您的值是一个ID - 而不是您打算将用作数字。换句话说,您不会将其用于计算。您只需将String转换为BigId,其唯一目的就是感知优化。

正如传奇人物Donald Knuth曾写道:&#34; 程序员浪费了大量时间思考或担心程序中非关键部分的速度,而这些效率尝试实际上有很大的负面影响考虑调试和维护时的影响。我们应该忘记效率很低,比如大约97%的时间:过早优化是所有邪恶的根源&#34;

解决效率问题实际发生时。现在,您有一个解决问题的解决方案 - 在花费大量时间之后,甚至没有一个可行的解决方案应该花费在您的分析质量上。

至于使用String作为一般事项的效率,请依赖于Tungsten optimizations Spark团队在引擎盖下非常努力,并密切关注球