如何避免数据集在映射时将列重命名为值?

时间:2017-08-31 13:13:23

标签: apache-spark apache-spark-dataset

在映射数据集时,我一直遇到从_1重命名列的问题,_2等于value,value。

导致重命名的原因是什么?

1 个答案:

答案 0 :(得分:0)

因为数据集上的map导致该查询在Spark中被序列化和反序列化。

要序列化它,Spark现在必须是编码器。那就是有一个对象ExpressionEncoder与方法适用。它的JavaDoc说:

 A factory for constructing encoders that convert objects and primitives to and from the
  internal row format using catalyst expressions and code generation.  By default, the
  expressions used to retrieve values from an input row when producing an object will be created as
  follows:
   - Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions
     and [[UnresolvedExtractValue]] expressions.
   - Tuples will have their subfields extracted by position using [[BoundReference]] expressions.
   - Primitives will have their values extracted from the first ordinal with a schema that defaults
     to the name `value`.

请看最后一点。您的查询只是映射到基元,因此Catalyst使用name" value"。

如果添加.select('value.as("MyPropertyName")).as[CaseClass],则字段名称将是正确的。

具有列名称"值":

的类型
  • 选项(_)
  • 阵列
  • 集合类型,如Seq,Map
  • 类型,如String,Timestamp,Date,BigDecimal