我需要将hive表模式与包含csv文件架构的数据框进行比较

时间:2018-01-02 12:51:43

标签: scala csv hadoop apache-spark dataframe

这里我试图验证hive表的结构和存储在s3中的CSV文件。

这是数据框中的CSV文件架构。

+----+----------------+----+---+-----------+-----------+
|S_No|        Variable|Type|Len|     Format|   Informat|
+----+----------------+----+---+-----------+-----------+
|   1|        DATETIME| Num|  8|DATETIME20.|DATETIME20.|
|   2|   LOAD_DATETIME| Num|  8|DATETIME20.|DATETIME20.|
|   3|     SOURCE_BANK|Char| 1 |       null|       null|
|   4|        EMP_NAME|Char| 50|       null|       null|
|   5|HEADER_ROW_COUNT| Num|  8|       null|       null|
|   6|      EMP _HOURS| Num|  8|       15.2|       15.1|
+----+----------------+----+---+-----------+-----------+

我需要将它与O / p

进行比较
import org.apache.spark.sql.hive.HiveContext
val targetTableName = "TableA"
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val targetRawData = hc.sql("Select datetime,load_datetime,trim(source_bank) as source_bank,trim(emp_name) as emp_name,header_row_count, emp_hours from " + targetTableName)

val schema= targetRawData.schema

是:schema:org.apache.spark.sql.types.StructType = StructType(StructField(datetime,TimestampType,true), StructField(load_datetime,TimestampType,true), StructField(source_bank,StringType,true), StructField(emp_name,StringType,true), StructField(header_row_count,IntegerType,true), StructField(emp_hours,DoubleType,true))

2 个答案:

答案 0 :(得分:0)

你也可以使用MegaSparDiff作为开源来比较多种类型的数据源,包括S3,HIVE,CSV,JDBC等。

https://github.com/FINRAOS/MegaSparkDiff

以下对将返回inLeftButNotInRight和inRightButNotInLeft作为DataFrames。

您的成功条件是两个DataFrame都没有记录,这意味着数据是相同的。 现在您可能不想加载所有数据,因为您只对比较架构感兴趣。

 SparkFactory.initializeSparkContext();

    AppleTable leftAppleTable = SparkFactory.parallelizeTextSource("S3://file1","table1");

    AppleTable rightAppleTable = SparkFactory.parallelizeHiveSource("select * from hivetable","hivetable");

    Pair<Dataset<Row>, Dataset<Row>> resultPair = SparkCompare.compareAppleTables(leftAppleTable, rightAppleTable);

    if (resultPair.getLeft().count() != 0 && resultPair.getRight().count() != 0)
    {
        //success condition
    }

    SparkFactory.stopSparkContext();

答案 1 :(得分:0)

  Naveen
    you can follow the below steps:

    define a class:


    import org.apache.spark.sql.DataFrame
    import org.apache.spark.sql.SQLImplicits
    import org.apache.spark.sql._

    class columnMatch(spark:SparkSession,inputdf: DataFrame, requiredColNames: Array[String]) {

    val missingColumns = requiredColNames.diff(inputdf.columns)

    def missingColumnsMessage(): String = {
        val missingColNames = missingColumns.mkString(", ")
        val allColNames = inputdf.columns.mkString(", ")


        s"The [${missingColNames}] columns are not included in the DataFrame with the following columns [${allColNames}]"

    def validatePresenceOfColumns(): String = {
        if (missingColumns.nonEmpty) {

          missingColumnsMessage()
        }
          else{
            s"No Mismatch in the column Name Found"
          }


      }
      }