Schema comparison of two dataframes in scala

时间:2017-12-18 06:10:49

标签: scala apache-spark-sql schema

I am trying to write some test cases to validate the data between source (.csv) file and target (hive table). One of the validation is the Structure validation of the table.

I have load the .csv data (using a defined schema) into one dataframe and extracted the hive table data into another dataframe.
When I now try to compare the schema of the two dataframes, it returns false. Not sure why. Any idea on this please?

source dataframe schema:

scala> res39.printSchema
root
 |-- datetime: timestamp (nullable = true)
 |-- load_datetime: timestamp (nullable = true)
 |-- source_bank: string (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- header_row_count: integer (nullable = true)
 |-- emp_hours: double (nullable = true)

target dataframe schema:

scala> targetRawData.printSchema
root
 |-- datetime: timestamp (nullable = true)
 |-- load_datetime: timestamp (nullable = true)
 |-- source_bank: string (nullable = true)
 |-- emp_name: string (nullable = true)
 |-- header_row_count: integer (nullable = true)
 |-- emp_hours: double (nullable = true)

When I compare, it returns false:

scala> res39.schema == targetRawData.schema
res47: Boolean = false

Data in the two dataframes is shown below:

scala> res39.show
+-------------------+-------------------+-----------+--------+----------------+---------+
|           datetime|      load_datetime|source_bank|emp_name|header_row_count|emp_hours|
+-------------------+-------------------+-----------+--------+----------------+---------+
|2017-01-01 01:02:03|2017-01-01 01:02:03|        RBS| Naveen |             100|    15.23|
|2017-03-15 01:02:03|2017-03-15 01:02:03|        RBS| Naveen |             100|   115.78|
|2015-04-02 23:24:25|2015-04-02 23:24:25|        RBS|   Arun |             200|     2.09|
|2010-05-28 12:13:14|2010-05-28 12:13:14|        RBS|   Arun |             100|    30.98|
|2018-06-04 10:11:12|2018-06-04 10:11:12|        XZX|   Arun |             400|     12.0|
+-------------------+-------------------+-----------+--------+----------------+---------+


scala> targetRawData.show
+-------------------+-------------------+-----------+--------+----------------+---------+
|           datetime|      load_datetime|source_bank|emp_name|header_row_count|emp_hours|
+-------------------+-------------------+-----------+--------+----------------+---------+
|2017-01-01 01:02:03|2017-01-01 01:02:03|        RBS|  Naveen|             100|    15.23|
|2017-03-15 01:02:03|2017-03-15 01:02:03|        RBS|  Naveen|             100|   115.78|
|2015-04-02 23:25:25|2015-04-02 23:25:25|        RBS|    Arun|             200|     2.09|
|2010-05-28 12:13:14|2010-05-28 12:13:14|        RBS|    Arun|             100|    30.98|
+-------------------+-------------------+-----------+--------+----------------+---------+

The complete code looks like below:

//import org.apache.spark
import org.apache.spark.sql.hive._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.functions.{to_date, to_timestamp}
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.SparkSession
import java.sql.Timestamp
import java.text.SimpleDateFormat
import java.text._
import java.util.Date
import scala.util._
import org.apache.spark.sql.hive.HiveContext

  //val conf = new SparkConf().setAppName("Simple Application")
  //val sc = new SparkContext(conf)
  val hc = new HiveContext(sc)
  val spark: SparkSession = SparkSession.builder().appName("Simple Application").config("spark.master", "local").getOrCreate()

   // set source and target location
    val sourceDataLocation = "hdfs://localhost:9000/source.txt"
    val targetTableName = "TableA"

    // Extract source data
    println("Extracting SAS source data from csv file location " + sourceDataLocation);
    val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val sourceRawCsvData = sc.textFile(sourceDataLocation)

    println("Extracting target data from hive table " + targetTableName)
    val targetRawData = hc.sql("Select datetime,load_datetime,trim(source_bank) as source_bank,trim(emp_name) as emp_name,header_row_count, emp_hours from " + targetTableName)


    // Add the test cases here
    // Test 2 - Validate the Structure
       val headerColumns = sourceRawCsvData.first().split(",").to[List]
       val schema = TableASchema(headerColumns)

       val data = sourceRawCsvData.mapPartitionsWithIndex((index, element) => if (index == 0) element.drop(1) else element)
       .map(_.split(",").toList)
       .map(row)

       val dataFrame = spark.createDataFrame(data,schema)
       val sourceDataFrame = dataFrame.toDF(dataFrame.columns map(_.toLowerCase): _*)
       data.collect
       data.getClass
    // Test 3 - Validate the data
    // Test 4 - Calculate the average and variance of Int or Dec columns
    // Test 5 - Test 5

  def UpdateResult(tableName: String, returnCode: Int, description: String){
    val insertString = "INSERT INTO TestResult VALUES('" + tableName + "', " + returnCode + ",'" + description + "')"
    val a = hc.sql(insertString)

    }


  def TableASchema(columnName: List[String]): StructType = {
    StructType(
      Seq(
        StructField(name = "datetime", dataType = TimestampType, nullable = true),
        StructField(name = "load_datetime", dataType = TimestampType, nullable = true),
        StructField(name = "source_bank", dataType = StringType, nullable = true),
        StructField(name = "emp_name", dataType = StringType, nullable = true),
        StructField(name = "header_row_count", dataType = IntegerType, nullable = true),
        StructField(name = "emp_hours", dataType = DoubleType, nullable = true)
        )
    )
  }

  def row(line: List[String]): Row = {
       Row(convertToTimestamp(line(0).trim), convertToTimestamp(line(1).trim), line(2).trim, line(3).trim, line(4).toInt, line(5).toDouble)
    }


  def convertToTimestamp(s: String) : Timestamp = s match {
     case "" => null
     case _ => {
        val format = new SimpleDateFormat("ddMMMyyyy:HH:mm:ss")
        Try(new Timestamp(format.parse(s).getTime)) match {
        case Success(t) => t
        case Failure(_) => null
      }
    }
  }

  }

6 个答案:

答案 0 :(得分:6)

基于 @Derek Kaknes 的回答,这是我提出的用于比较架构的解决方案,仅关注列名称,数据类型&可空性,对元数据

无动于衷
// Extract relevant information: name (key), type & nullability (values) of columns
def getCleanedSchema(df: DataFrame): Map[String, (DataType, Boolean)] = {
    df.schema.map { (structField: StructField) =>
      structField.name.toLowerCase -> (structField.dataType, structField.nullable)
    }.toMap
  }

// Compare relevant information
def getSchemaDifference(schema1: Map[String, (DataType, Boolean)],
                        schema2: Map[String, (DataType, Boolean)]
                       ): Map[String, (Option[(DataType, Boolean)], Option[(DataType, Boolean)])] = {
  (schema1.keys ++ schema2.keys).
    map(_.toLowerCase).
    toList.distinct.
    flatMap { (columnName: String) =>
      val schema1FieldOpt: Option[(DataType, Boolean)] = schema1.get(columnName)
      val schema2FieldOpt: Option[(DataType, Boolean)] = schema2.get(columnName)

      if (schema1FieldOpt == schema2FieldOpt) None
      else Some(columnName -> (schema1FieldOpt, schema2FieldOpt))
    }.toMap
}
  • getCleanedSchema方法提取感兴趣的信息 - 列数据类型& nullability 并将map列名称返回tuple

  • getSchemaDifference方法返回map只包含两个模式中不同的列。如果两个模式之一中没有列,那么它的相应属性将是None

答案 1 :(得分:4)

我遇到了完全相同的问题。从Hive读取数据时,架构的StructField组件有时会在字段metadata中包含Hive元数据。 打印模式时看不到它,因为此字段不是toString定义的一部分。

这是我决定使用的解决方案,在比较之前,我只是获得了具有空元数据的架构副本:

schema.map(_.copy(metadata = Metadata.empty))

答案 2 :(得分:3)

之前我遇到过此问题,这是由StructField.metadata属性的差异造成的。几乎不可能开箱即用,因为对StructField的简单检查只会显示名称,数据类型和可空值。我调试它的建议是比较字段的元数据。这样的事情可能是:

res39.schema.zip(targetRawData.schema).foreach{ case (r: StructField, t: StructField) => 
  println(s"Field: ${r.name}\n--| res_meta: ${r.metadata}\n--|target_meta: ${t.metadata}")}

如果您想比较模式但忽略元数据,那么我没有一个很好的解决方案。我能够提出的最好的方法是迭代StructFields并手动删除元数据,然后创建没有元数据的数据帧的临时副本。所以你可以这样做(假设df是要删除元数据的数据帧):

val schemaWithoutMetadata = StructType(df.schema.map{ case f: StructField => 
  StructField(f.name, f.dataType, f.nullable)
})
val tmpDF = spark.sqlContext.createDataFrame(df.rdd, schemaWithoutMetadata)

然后,您可以直接比较数据帧,也可以比较您尝试的方式。我认为这个解决方案不具备性能,因此只能用于小型数据集。

答案 3 :(得分:1)

这是基于观察到的name + DataType + nullable的字符串表示形式对每一列都是唯一的另一种解决方案:

import org.apache.spark.sql.types.{StructType, StructField}

val schemaDiff: (StructType, StructType)  => List[StructField] = (schema1, schema2) => {
      val toMap: StructType => Map[String, StructField] = schema => {
        schema.map(sf => {
          val name = s"${sf.name}-${sf.dataType.typeName}-${sf.nullable.toString}"
          (name -> sf)
        }).toMap
      }

      val schema1Set = toMap(schema1).toSet
      val schema2Set = toMap(schema2).toSet
      val commonItems =  schema1Set.intersect(schema2Set)

      (schema1Set ++ schema2Set -- commonItems).toMap.values.toList
}

请注意,字段名称区分大小写,因此不同的列名称意味着不同的列。

步骤:

  1. 为每个架构生成一个Map[String, StructField],其中每个键的格式为name-datatype-nullable
  2. 获取模式的交集
  3. 从模式联合中减去交集
  4. 将差异返回StructField的列表中

用法:schemaDiff(df1.schema, df.schema)

答案 4 :(得分:0)

这是Java级别的对象比较问题,您应该尝试使用.equals()。除非不同的SourceType引入了元数据和可空性问题,否则这通常是有效的。

答案 5 :(得分:0)

val csDf = res39       // any source dataframe
val myDf = targetRawData   // target data frame

val csFields = csDf.schema.fields
val myFields = myDf.schema.fields

val csFieldNameTypeMap = csFields.map(f => f.name -> f.dataType).toMap
val myFieldNameTypemap = myFields.map(f => f.name->f.dataType).toMap

val diffFields = csFields.filter(f =>  csFieldNameTypeMap.get(f.name) != myFieldNameTypemap.get(f.name) ).toList
val diffFieldsMyDf = myFields.filter(f =>  csFieldNameTypeMap.get(f.name) != myFieldNameTypemap.get(f.name) ).toList

'diffFields' 和 'diffFieldsMyDf' 将为您提供具有不同数据类型的字段。也可以执行类似的步骤来检查 'nullable',只需将 'dataType' 替换为 'nullable'