在映射数据集时,我一直遇到从_1重命名列的问题,_2等于value,value。
导致重命名的原因是什么?
答案 0 :(得分:0)
因为数据集上的map
导致该查询在Spark中被序列化和反序列化。
要序列化它,Spark现在必须是编码器。那就是有一个对象ExpressionEncoder与方法适用。它的JavaDoc说:
A factory for constructing encoders that convert objects and primitives to and from the
internal row format using catalyst expressions and code generation. By default, the
expressions used to retrieve values from an input row when producing an object will be created as
follows:
- Classes will have their sub fields extracted by name using [[UnresolvedAttribute]] expressions
and [[UnresolvedExtractValue]] expressions.
- Tuples will have their subfields extracted by position using [[BoundReference]] expressions.
- Primitives will have their values extracted from the first ordinal with a schema that defaults
to the name `value`.
请看最后一点。您的查询只是映射到基元,因此Catalyst使用name" value"。
如果添加.select('value.as("MyPropertyName")).as[CaseClass]
,则字段名称将是正确的。
具有列名称"值":
的类型