我正在尝试使用以下代码将包含List[Annotation]
的列添加到Spark DataFrame(我已经重新格式化了所有内容,因此可以通过直接复制和粘贴来重现)。
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
case class Annotation(
field1: String,
field2: String,
field3: Int,
field4: Float,
field5: Int,
field6: List[Mapping]
)
case class Mapping(
fieldA: String,
fieldB: String,
fieldC: String,
fieldD: String,
fieldE: String
)
object StructTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
import spark.implicits._
val annotationStruct =
StructType(
Array(
StructField("field1", StringType, nullable = true),
StructField("field2", StringType, nullable = true),
StructField("field3", IntegerType, nullable = false),
StructField("field4", FloatType, nullable = false),
StructField("field5", IntegerType, nullable = false),
StructField(
"field6",
ArrayType(
StructType(Array(
StructField("fieldA", StringType, nullable = true),
StructField("fieldB", StringType, nullable = true),
StructField("fieldC", StringType, nullable = true),
StructField("fieldD", StringType, nullable = true),
StructField("fieldE", StringType, nullable = true)
))),
nullable = true
)
)
)
val df = List(1).toDF
val annotation = Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))
val schema = df.schema
val newSchema = schema.add("annotations", ArrayType(annotationStruct), false)
val rdd = df.rdd.map(x => Row.fromSeq(x.toSeq :+ List(annotation)))
val newDF = spark.createDataFrame(rdd, newSchema)
newDF.printSchema
newDF.show
}
}
但是,我在运行此代码时遇到错误。
Caused by: java.lang.RuntimeException: Annotation is not a valid external type for schema of struct<field1:string,field2:string,field3:int,field4:float,field5:int,field6:array<struct<fieldA:string,fieldB:string,fieldC:string,fieldD:string,fieldE:string>>>
使用ArrayType(annotationStruct)
创建数据框时,我传入的模式(createDataFrame
)似乎表单不正确,但它似乎与仅包含List[Annotation]
的DataFrame的模式匹配
编辑:使用简单类型而不是案例类以这种方式修改DF架构的示例。
val df = List(1).toDF
spark.createDataFrame(df.rdd.map(x => Row.fromSeq(x.toSeq :+ "moose")), df.schema.add("moose", StringType, false)).show
+-----+-----+
|value|moose|
+-----+-----+
| 1|moose|
+-----+-----+
编辑2:我已经解决了这个问题。遗憾的是,我没有直接从案例类创建DataFrame的选项,这就是我尝试使用ScalaReflection将其镜像为Struct的原因。在这种情况下,我没有改变以前的模式,只是尝试从包含我的案例类列表的行的RDD创建一个DataFrame。 Spark在1.6中有一个问题,它会影响解析可能为空或为空的结构数组 - 我不知道这些结构是否已链接。
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val annotationSchema = ScalaReflection.schemaFor[Annotation].dataType.asInstanceOf[StructType]
val annotation = Annotation("1", "2", 1, .5, 1, List(Mapping("a", "b", "c", "d", "e")))
val testRDD = spark.sparkContext.parallelize(List(List(annotation))).map(x => Row(x))
val testSchema = StructType(
Array(StructField("annotations", ArrayType(annotationSchema), false)
))
spark.createDataFrame(testRDD, testSchema).show
答案 0 :(得分:1)
如果您担心向现有数据框添加复杂列,那么以下解决方案应该适合您。
val df = List(1).toDF
val annotation = sc.parallelize(List(Annotation("1", "2", 1, .5f, 1, List(Mapping("a", "b", "c", "d", "e")))))
val newDF = df.rdd.zip(annotation).map(x => Merged(x._1.get(0).asInstanceOf[Int], x._2)).toDF
newDF.printSchema
newDF.show(false)
应该给你
root
|-- value: integer (nullable = false)
|-- annotations: struct (nullable = true)
| |-- field1: string (nullable = true)
| |-- field2: string (nullable = true)
| |-- field3: integer (nullable = false)
| |-- field4: float (nullable = false)
| |-- field5: integer (nullable = false)
| |-- field6: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- fieldA: string (nullable = true)
| | | |-- fieldB: string (nullable = true)
| | | |-- fieldC: string (nullable = true)
| | | |-- fieldD: string (nullable = true)
| | | |-- fieldE: string (nullable = true)
+-----+---------------------------------------+
|value|annotations |
+-----+---------------------------------------+
|1 |[1,2,1,0.5,1,WrappedArray([a,b,c,d,e])]|
+-----+---------------------------------------+
使用的案例类与您创建的Merged
案例类相同。
case class Merged(value : Int, annotations: Annotation)
case class Annotation(field1: String, field2: String, field3: Int, field4: Float, field5: Int, field6: List[Mapping])
case class Mapping(fieldA: String, fieldB: String, fieldC: String, fieldD: String, fieldE: String)
当使用案例类时,我们不需要定义架构 。使用case类和sqlContext.createDataFrame创建列名的过程是不同的。