使用Spark加载XML时推断架构的重复字段

时间:2018-06-14 19:54:45

标签: xml scala apache-spark dataframe

我想在这个结构中创建一个架构:

|    |-- Features: struct (nullable = true)
|    |    |-- Feature: array (nullable = true)
|    |    |    |-- element: string (containsNull = true)

这是我的代码:

StructField( "Features", StructType(
        Array(
          StructField( "Feature", ArrayType(
            StructType(
              Array(
                StructField( "element", StringType, true )
              )
            )
          ) )
        )
      ), true )

结果:

|    |-- Features: struct (nullable = true)
|    |    |-- Feature: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- element: string (nullable = true)

任何想法?

1 个答案:

答案 0 :(得分:1)

您应该省略最里面的struct

import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(Seq(StructField("Features", StructType(Seq(
  StructField("Feature", ArrayType(StringType))
)))))

spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema).printSchema
// root
//  |-- Features: struct (nullable = true)
//  |    |-- Feature: array (nullable = true)
//  |    |    |-- element: string (containsNull = true)