将复杂列附加到Spark Dataframe

时间:2018-02-09 00:14:25

标签: scala apache-spark apache-spark-sql

我正在尝试使用以下代码将包含List[Annotation]的列添加到Spark DataFrame(我已经重新格式化了所有内容,因此可以通过直接复制和粘贴来重现)。

import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._

case class Annotation(
                            field1: String,
                            field2: String,
                            field3: Int,
                            field4: Float,
                            field5: Int,
                            field6: List[Mapping]
                          )

case class Mapping(
                            fieldA: String,
                            fieldB: String,
                            fieldC: String,
                            fieldD: String,
                            fieldE: String
                          )

object StructTest {
  def main(args: Array[String]): Unit = {
    val spark               = SparkSession.builder().master("local[*]").getOrCreate()
    import spark.implicits._
    val annotationStruct =
      StructType(
        Array(
          StructField("field1", StringType, nullable = true),
          StructField("field2", StringType, nullable = true),
          StructField("field3", IntegerType, nullable = false),
          StructField("field4", FloatType, nullable = false),
          StructField("field5", IntegerType, nullable = false),
          StructField(
            "field6",
            ArrayType(
              StructType(Array(
                StructField("fieldA", StringType, nullable = true),
                StructField("fieldB", StringType, nullable = true),
                StructField("fieldC", StringType, nullable = true),
                StructField("fieldD", StringType, nullable = true),
                StructField("fieldE", StringType, nullable = true)
              ))),
            nullable = true
          )
        )
      )

    val df = List(1).toDF
    val annotation = Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))
    val schema = df.schema
    val newSchema = schema.add("annotations", ArrayType(annotationStruct), false)
    val rdd = df.rdd.map(x => Row.fromSeq(x.toSeq :+ List(annotation)))
    val newDF = spark.createDataFrame(rdd, newSchema)
    newDF.printSchema
    newDF.show
  }
}

但是,我在运行此代码时遇到错误。

Caused by: java.lang.RuntimeException: Annotation is not a valid external type for schema of struct<field1:string,field2:string,field3:int,field4:float,field5:int,field6:array<struct<fieldA:string,fieldB:string,fieldC:string,fieldD:string,fieldE:string>>>

使用ArrayType(annotationStruct)创建数据框时,我传入的模式(createDataFrame)似乎表单不正确,但它似乎与仅包含List[Annotation]的DataFrame的模式匹配

编辑:使用简单类型而不是案例类以这种方式修改DF架构的示例。

val df = List(1).toDF
spark.createDataFrame(df.rdd.map(x => Row.fromSeq(x.toSeq :+ "moose")), df.schema.add("moose", StringType, false)).show
+-----+-----+
|value|moose|
+-----+-----+
|    1|moose|
+-----+-----+

编辑2:我已经解决了这个问题。遗憾的是,我没有直接从案例类创建DataFrame的选项,这就是我尝试使用ScalaReflection将其镜像为Struct的原因。在这种情况下,我没有改变以前的模式,只是尝试从包含我的案例类列表的行的RDD创建一个DataFrame。 Spark在1.6中有一个问题,它会影响解析可能为空或为空的结构数组 - 我不知道这些结构是否已链接。

val spark = SparkSession.builder().master("local[*]").getOrCreate()
    val annotationSchema = ScalaReflection.schemaFor[Annotation].dataType.asInstanceOf[StructType]
    val annotation       = Annotation("1", "2", 1, .5, 1, List(Mapping("a", "b", "c", "d", "e")))
    val testRDD = spark.sparkContext.parallelize(List(List(annotation))).map(x => Row(x))
    val testSchema = StructType(
      Array(StructField("annotations", ArrayType(annotationSchema), false)
    ))
spark.createDataFrame(testRDD, testSchema).show

1 个答案:

答案 0 :(得分:1)

如果您担心向现有数据框添加复杂列,那么以下解决方案应该适合您。

val df = List(1).toDF
val annotation = sc.parallelize(List(Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))))
val newDF = df.rdd.zip(annotation).map(x => Merged(x._1.get(0).asInstanceOf[Int], x._2)).toDF
newDF.printSchema
newDF.show(false)

应该给你

root
 |-- value: integer (nullable = false)
 |-- annotations: struct (nullable = true)
 |    |-- field1: string (nullable = true)
 |    |-- field2: string (nullable = true)
 |    |-- field3: integer (nullable = false)
 |    |-- field4: float (nullable = false)
 |    |-- field5: integer (nullable = false)
 |    |-- field6: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- fieldA: string (nullable = true)
 |    |    |    |-- fieldB: string (nullable = true)
 |    |    |    |-- fieldC: string (nullable = true)
 |    |    |    |-- fieldD: string (nullable = true)
 |    |    |    |-- fieldE: string (nullable = true)

+-----+---------------------------------------+
|value|annotations                            |
+-----+---------------------------------------+
|1    |[1,2,1,0.5,1,WrappedArray([a,b,c,d,e])]|
+-----+---------------------------------------+

使用的案例类与您创建的Merged 案例类相同。

case class Merged(value : Int, annotations: Annotation)
case class Annotation(field1: String, field2: String, field3: Int, field4: Float, field5: Int, field6: List[Mapping])
case class Mapping(fieldA: String, fieldB: String, fieldC: String, fieldD: String, fieldE: String)

当使用案例类时,我们不需要定义架构 。使用case类和sqlContext.createDataFrame创建列名的过程是不同的。