Question

我有DataFrame df，其中包含一些数据，这些数据是计算过程的结果。然后，我将此DataFrame存储在数据库中以备将来使用。

例如：

val rowsRDD: RDD[Row] = sc.parallelize(
  Seq(
    Row("first", 2.0, 7.0),
    Row("second", 3.5, 2.5),
    Row("third", 7.0, 5.9)
  )
)

val schema = new StructType()
  .add(StructField("id", StringType, true))
  .add(StructField("val1", DoubleType, true))
  .add(StructField("val2", DoubleType, true))

val df = spark.createDataFrame(rowsRDD, schema)

我需要检查最终DataFrame中的所有列是否都与特定数据类型相对应。当然，一种方法是使用架构创建DataFrame（如上述示例）。但是，在某些情况下，有时会在计算过程中将更改引入数据类型-创建初始DataFrame之后（例如，当更改了应用于DataFrame的某些公式时）。

因此，我想再次检查最终数据框是否与初始架构相对应。如果不对应，那么我想应用相应的转换。有什么办法吗？

Answer 1

您可以使用模式方法获取数据框的模式

df.schema

定义一个castColumn方法

def castColumn(df: DataFrame, colName: String, randomDataType: DataType): DataFrame = {
    df.withColumn(colName, df.col(colName).cast(randomDataType))
}

然后将此方法应用于您需要转换的所有列。

首先，获取一个具有colName和目标dataType的元组数组

//Assume your dataframes have the same column names, you need to sortBy in case the it is not in the same order

// You can also iterate through dfOrigin.schema only and compare their dataTypes with target dataTypes instead of zipping

val differences = (dfOrigin.schema.fields.sortBy{case (x: StructField) => x.name} zip dfTarget.schema.fields.sortBy{case (x: StructField) => x.name}).collect{
                   case (origin: StructField, target: StructField) if origin.dataType != target.dataType => 
                        (origin.name, target.dataType)
}

然后

 differences.foldLeft(df){
      case (acc, value) => castColumn(acc, value._1, value._2)
 }

Answer 2

如果我正确理解了您的要求，下面的示例说明了如何将具有更改的列类型的DataFrame还原为其原始版本：

import org.apache.spark.sql.types._

val df1 = Seq(
  (1, "a", 100L, 10.0), (2, "b", 200L, 20.0)
).toDF("c1", "c2", "c3", "c4")

val df2 = Seq(
  (1, "a", 100, 10.0f), (2, "b", 200, 20.0f)
).toDF("c1", "c2", "c3", "c4")

df2.printSchema
// root
//  |-- c1: integer (nullable = false)
//  |-- c2: string (nullable = true)
//  |-- c3: integer (nullable = false)
//  |-- c4: float (nullable = false)

val fieldsDiffType = (df1.schema.fields zip df2.schema.fields).collect{
  case (a: StructField, b: StructField) if a.dataType != b.dataType =>
    (a.name, a.dataType)
}
// fieldsDiffType: Array[(String, org.apache.spark.sql.types.DataType)] =
//   Array((c3,LongType), (c4,DoubleType))

val df2To1 = fieldsDiffType.foldLeft(df2)( (accDF, field) =>
  accDF.withColumn(field._1, col(field._1).cast(field._2))
)

df2To1.printSchema
// root
//  |-- c1: integer (nullable = false)
//  |-- c2: string (nullable = true)
//  |-- c3: long (nullable = false)
//  |-- c4: double (nullable = false)

请注意，此解决方案仅在DataFrame列的大小和顺序保持相同且不覆盖Array或Struct之类的类型时有效。

[更新]

如果担心列顺序可能会更改，则可以先执行df1.schema.fields和df2.schema.fields的顺序，然后再执行zip：

df1.schema.fields.sortBy(_.name) zip df2.schema.fields.sortBy(_.name)

Answer 3

基于https://spark.apache.org/docs/2.2.0/sql-programming-guide.html的无类型数据集操作，它应该是：

df.printSchema()

Answer 4

您可以尝试

AbstractUser

这会以树格式打印架构。希望这会有所帮助。

如何检查DataFrame的架构？

4 个答案: