出于某种原因,我试图将数据框的所有字段(具有嵌套的structTypes)转换为String。
我已经在StackOverflow中看到了一些解决方案(但它们仅适用于没有嵌套结构的简单数据框)(例如此处how to cast all columns of dataframe to string)
我将通过一个示例来说明我真正需要的东西:
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
val rows1 = Seq(
Row(1, Row("a", "b"), 8.00, Row(1,2)),
Row(2, Row("c", "d"), 9.00, Row(3,4))
)
val rows1Rdd = spark.sparkContext.parallelize(rows1, 4)
val schema1 = StructType(
Seq(
StructField("id", IntegerType, true),
StructField("s1", StructType(
Seq(
StructField("x", StringType, true),
StructField("y", StringType, true)
)
), true),
StructField("d", DoubleType, true),
StructField("s2", StructType(
Seq(
StructField("u", IntegerType, true),
StructField("v", IntegerType, true)
)
), true)
)
)
val df1 = spark.createDataFrame(rows1Rdd, schema1)
println("Schema with nested struct")
df1.printSchema()
如果我们打印创建的数据框的架构,我们将得到以下结果:
root
|-- id: integer (nullable = true)
|-- s1: struct (nullable = true)
| |-- x: string (nullable = true)
| |-- y: string (nullable = true)
|-- d: double (nullable = true)
|-- s2: struct (nullable = true)
| |-- u: integer (nullable = true)
| |-- v: integer (nullable = true)
我尝试将所有值转换为字符串,如下所示:
df1.select(df1.columns.map(c => col(c).cast(StringType)) : _*)
但是它将嵌套的structTypes转换为字符串,而不是将其每个值都强制转换为String:
root
|-- id: string (nullable = true)
|-- s1: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: string (nullable = true)
有没有简单的解决方案,可以帮助我将所有值转换为StringType? 这是我想要在强制转换后用作数据框架构的StructType:
root
|-- id: string (nullable = true)
|-- s1: struct (nullable = true)
| |-- x: string (nullable = true)
| |-- y: string (nullable = true)
|-- d: string (nullable = true)
|-- s2: struct (nullable = true)
| |-- u: string (nullable = true)
| |-- v: string (nullable = true)
非常感谢!
答案 0 :(得分:2)
您可以分别为更简单的类型列和struct
类型的列创建SQL表达式。
该解决方案不是很通用,但是只要您只有结构类型作为复杂列,就可以使用。该代码可以处理struct
下的可变列数,而不仅仅是两个。
val structCastExpression = df1.schema
.filter(_.dataType.isInstanceOf[StructType])
.map(c=> (c.name, c.dataType.asInstanceOf[StructType].map(_.name)))
.map{ case (col, sub) => s"""cast(${col} as struct${sub.map{ c => s"$c:string" }.mkString("<" , "," , ">")} ) as $col"""}
//List(cast(s1 as struct<x:string,y:string> ) as s1,
// cast(s2 as struct<u:string,v:string> ) as s2)
val otherColumns = df1.schema
.filterNot(_.dataType.isInstanceOf[StructType])
.map( c=> s""" cast(${c.name} as string) as ${c.name} """)
//List(" cast(id as string) as id ", " cast(d as string) as d ")
//original columns
val originalColumns = df1.columns
// Union both the expressions into one big expression
val finalExpression = otherColumns.union(structCastExpression)
// List(" cast(id as string) as id ",
// " cast(d as string) as d ",
// cast(s1 as struct<x:string,y:string> ) as s1,
// cast(s2 as struct<u:string,v:string> ) as s2 )
// Use `selectExpr` to pass the expression
df1.selectExpr(finalExpression : _*)
.select(originalColumns.head, originalColumns.tail: _*)
.printSchema
//root
// |-- id: string (nullable = true)
// |-- s1: struct (nullable = true)
// | |-- x: string (nullable = true)
// | |-- y: string (nullable = true)
// |-- d: string (nullable = true)
// |-- s2: struct (nullable = true)
// | |-- u: string (nullable = true)
// | |-- v: string (nullable = true)
答案 1 :(得分:0)
您可以按以下方式使用udf
和custom case class
case class s2(u:String,v:String)
def changeToStr(row:Row):s2={
return s2(row.get(0).toString(),row.get(1).toString())
}
val changeToStrUDF=udf(changeToStr _)
val df2=df1.select(df1.col("id").cast(StringType),df1.col("s1"),df1.col("d").cast(StringType),changeToStrUDF(df1.col("s2")).alias("s2"))
答案 2 :(得分:0)
经过几天的调查,我找到了解决问题的最佳方案:
val newSchema = StructType(
Seq(
StructField("id", StringType, true),
StructField("s1", StructType(
Seq(
StructField("x", StringType, true),
StructField("y", StringType, true)
)
), true),
StructField("d", StringType, true),
StructField("s2", StructType(
Seq(
StructField("u", StringType, true),
StructField("v", StringType, true)
)
), true)
)
)
val expressions = newSchema.map(
field => s"CAST ( ${field.name} As ${field.dataType.sql}) ${field.name}"
)
val result = df1.selectExpr(expressions : _*)
result.show()
+---+------+---+------+
| id| s1| d| s2|
+---+------+---+------+
| 1|[a, b]|8.0|[1, 2]|
| 2|[c, d]|9.0|[3, 4]|
+---+------+---+------+
我希望它能对某人有所帮助,我花了很多时间试图找到这种通用解决方案(因为我正在使用大型数据框和需要转换的许多列,所以我需要它。)