如何将DataFrame的所有列(具有嵌套的StructTypes)转换为Spark中的字符串

时间:2018-07-25 09:53:25

标签: scala apache-spark apache-spark-sql bigdata

出于某种原因,我试图将数据框的所有字段(具有嵌套的structTypes)转换为String。

我已经在StackOverflow中看到了一些解决方案(但它们仅适用于没有嵌套结构的简单数据框)(例如此处how to cast all columns of dataframe to string

我将通过一个示例来说明我真正需要的东西:

    import org.apache.spark.sql.{Row, SparkSession}
    import org.apache.spark.sql.types._
    import org.apache.spark.sql.functions._
    import spark.implicits._
    val rows1 = Seq(
    Row(1, Row("a", "b"), 8.00, Row(1,2)),
    Row(2, Row("c", "d"), 9.00, Row(3,4))
    )

    val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)

    val schema1 = StructType(
    Seq(
    StructField("id", IntegerType, true),
    StructField("s1", StructType(
    Seq(
    StructField("x", StringType, true),
    StructField("y", StringType, true)
    )
    ), true),
    StructField("d", DoubleType, true),
    StructField("s2", StructType(
    Seq(
    StructField("u", IntegerType, true),
    StructField("v", IntegerType, true)
    )
    ), true)
    )
    )

    val df1 = spark.createDataFrame(rows1Rdd, schema1)

    println("Schema with nested struct")
    df1.printSchema()

如果我们打印创建的数据框的架构,我们将得到以下结果:

root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: integer (nullable = true)
|    |-- v: integer (nullable = true)

我尝试将所有值转换为字符串,如下所示:

  df1.select(df1.columns.map(c => col(c).cast(StringType)) : _*)

但是它将嵌套的structTypes转换为字符串,而不是将其每个值都强制转换为String:

root
|-- id: string (nullable = true)
|-- s1: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: string (nullable = true)

有没有简单的解决方案,可以帮助我将所有值转换为StringType? 这是我想要在强制转换后用作数据框架构的StructType:

root
|-- id: string (nullable = true)
|-- s1: struct (nullable = true)
|    |-- x: string (nullable = true)
|    |-- y: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: struct (nullable = true)
|    |-- u: string (nullable = true)
|    |-- v: string (nullable = true)

非常感谢!

3 个答案:

答案 0 :(得分:2)

您可以分别为更简单的类型列和struct类型的列创建SQL表达式。

该解决方案不是很通用,但是只要您只有结构类型作为复杂列,就可以使用。该代码可以处理struct下的可变列数,而不仅仅是两个。

val structCastExpression = df1.schema
                              .filter(_.dataType.isInstanceOf[StructType])
                              .map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
                              .map{ case (col, sub) =>  s"""cast(${col} as struct${sub.map{ c => s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1,
//     cast(s2 as struct<u:string,v:string> ) as s2)

val otherColumns = df1.schema
                      .filterNot(_.dataType.isInstanceOf[StructType])
                      .map( c=> s""" cast(${c.name} as string) as ${c.name} """)
//List(" cast(id as string) as id ", " cast(d as string) as d ")

//original columns
val originalColumns = df1.columns

// Union both the expressions into one big expression
val finalExpression = otherColumns.union(structCastExpression)
// List(" cast(id as string) as id ", 
//      " cast(d as string) as d ", 
//      cast(s1 as struct<x:string,y:string> ) as s1, 
//      cast(s2 as struct<u:string,v:string> ) as s2 )

// Use `selectExpr` to pass the expression 

df1.selectExpr(finalExpression : _*)
   .select(originalColumns.head, originalColumns.tail: _*)
   .printSchema

//root
// |-- id: string (nullable = true)
// |-- s1: struct (nullable = true)
// |    |-- x: string (nullable = true)
// |    |-- y: string (nullable = true)
// |-- d: string (nullable = true)
// |-- s2: struct (nullable = true)
// |    |-- u: string (nullable = true)
// |    |-- v: string (nullable = true)

答案 1 :(得分:0)

您可以按以下方式使用udfcustom case class

case class s2(u:String,v:String)
def changeToStr(row:Row):s2={
    return s2(row.get(0).toString(),row.get(1).toString())
  }

val changeToStrUDF=udf(changeToStr _)
val df2=df1.select(df1.col("id").cast(StringType),df1.col("s1"),df1.col("d").cast(StringType),changeToStrUDF(df1.col("s2")).alias("s2"))

答案 2 :(得分:0)

经过几天的调查,我找到了解决问题的最佳方案:

val newSchema = StructType(
Seq(
StructField("id", StringType, true),
StructField("s1", StructType(
Seq(
StructField("x", StringType, true),
StructField("y", StringType, true)
)
), true),
StructField("d", StringType, true),
StructField("s2", StructType(
Seq(
StructField("u", StringType, true),
StructField("v", StringType, true)
)
), true)
)
)
val expressions = newSchema.map(
  field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
)
val result = df1.selectExpr(expressions : _*)
result.show()
+---+------+---+------+
| id|    s1|  d|    s2|
+---+------+---+------+
|  1|[a, b]|8.0|[1, 2]|
|  2|[c, d]|9.0|[3, 4]|
+---+------+---+------+

我希望它能对某人有所帮助,我花了很多时间试图找到这种通用解决方案(因为我正在使用大型数据框和需要转换的许多列,所以我需要它。)