有没有办法在Spark数据集中保留变量的顺序?

时间:2017-05-24 22:22:49

标签: apache-spark spark-dataframe apache-spark-dataset

我正在创建一个Spark数据集

Dataset<myBeanClass> myDataset = myDataFrame.as(Encoders.bean(myBeanClass.class));

此时,它的架构看起来像,

 root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- gender: string (nullable = true)

执行地图转换后,

Dataset<myBeanClass> resultDataset = myDataset.map(new MapFunction<myBeanClass,myBeanClass>() {
    @Override
    public myBeanClass call(myBeanClass v1) throws Exception {

        // some code
        return v1;
    }

}, Encoders.bean(myBeanClass.class));

架构变为

 root
 |-- age: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)

同样注意到this示例中的相同行为。有没有办法保留订单?

1 个答案:

答案 0 :(得分:0)

我无法找到一种方法来阻止模式中变量的顺序发生变化。但我能够将它转换回我想要的任何顺序。我是这样做的,

DataFrame resultsDataFrame = myDataset.toDF().selectExpr(myDataFrame.schema().fieldNames());

resultsDataFrame的架构与我从

创建数据集的DataFrame的架构相同
root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- gender: string (nullable = true)