如何比较从RDBMS表读取的数据框的架构与Hive上的同一表?

时间:2018-09-03 18:45:18

标签: scala apache-spark

我通过从postgres读取RDBMS表来创建数据框,如下所示:

n+1

val yearDF = spark.read.format("jdbc").option("url", connectionUrl) .option("dbtable", s"(${execQuery}) as year2017") .option("user", devUserName) .option("password", devPassword) .option("numPartitions",10) .load() 内容:execQuery

这是我最终数据帧的架构:

select qtd_balance_text,ytd_balance_text,del_flag,source_system_name,period_year from dbname.hrtable;

Hive上有一个同名的表:println(yearDF.schema) StructType(StructField(qtd_balance_text,StringType,true), StructField(ytd_balance_text,StringType,true), StructField(del_flag,IntegerType,true), StructField(source_system_name,StringType,true), StructField(period_year,DecimalType(15,0),true)) 和同名的列。在将数据提取到Hive表之前,我想对代码进行检查,以查看GP和Hive表的架构是否相同。 我可以按以下方式访问架构:

hrtable

但是问题在于它以不同的方式收集模式

spark.sql("desc formatted databasename.hrtable").collect.foreach(println)

很显然,我无法以这种方式显示模式,而且我不明白该怎么做。 谁能让我知道如何正确比较数据框[ qtd_balance_text,bigint,null] [ ytd_balance_text,string,null] [ del_flag,string,null] [source_system_name,bigint,null] [ period_year,bigint,null] [Type,MANAGED,] [Provider,hive,] [Table Properties,[orc.stripe.size=536870912, transient_lastDdlTime=1523914516, last_modified_time=1523914516, last_modified_by=username, orc.compress.size=268435456, orc.compress=ZLIB, serialization.null.format=null],] [Location,hdfs://devenv/apps/hive/warehouse/databasename.db/hrtable,] [Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,] [InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,] [OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,] [Storage Properties,[serialization.format=1],] [Partition Provider,Catalog,] 和配置单元表yearDF的架构吗?

1 个答案:

答案 0 :(得分:0)

您可以尝试使用此选项,而不是解析Hive表架构输出。

还读取Hive表作为数据框。假设此数据帧为df1,而您的yearDFdf2。然后比较如下所示的模式。

如果两个数据帧之间列数的可能性也有所不同,则还应保持附加的df1.size == df2.size比较if循环。

val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.

val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there

if(out.size == 0) { // size of `out` should be 0 if matching
    println("matching")
}
else println("not matching")