我通过从postgres读取RDBMS表来创建数据框,如下所示:
n+1
val yearDF = spark.read.format("jdbc").option("url", connectionUrl)
.option("dbtable", s"(${execQuery}) as year2017")
.option("user", devUserName)
.option("password", devPassword)
.option("numPartitions",10)
.load()
内容:execQuery
这是我最终数据帧的架构:
select qtd_balance_text,ytd_balance_text,del_flag,source_system_name,period_year from dbname.hrtable;
Hive上有一个同名的表:println(yearDF.schema)
StructType(StructField(qtd_balance_text,StringType,true), StructField(ytd_balance_text,StringType,true), StructField(del_flag,IntegerType,true), StructField(source_system_name,StringType,true), StructField(period_year,DecimalType(15,0),true))
和同名的列。在将数据提取到Hive表之前,我想对代码进行检查,以查看GP和Hive表的架构是否相同。
我可以按以下方式访问架构:
hrtable
但是问题在于它以不同的方式收集模式
spark.sql("desc formatted databasename.hrtable").collect.foreach(println)
很显然,我无法以这种方式显示模式,而且我不明白该怎么做。
谁能让我知道如何正确比较数据框[ qtd_balance_text,bigint,null]
[ ytd_balance_text,string,null]
[ del_flag,string,null]
[source_system_name,bigint,null]
[ period_year,bigint,null]
[Type,MANAGED,]
[Provider,hive,]
[Table Properties,[orc.stripe.size=536870912, transient_lastDdlTime=1523914516, last_modified_time=1523914516, last_modified_by=username, orc.compress.size=268435456, orc.compress=ZLIB, serialization.null.format=null],]
[Location,hdfs://devenv/apps/hive/warehouse/databasename.db/hrtable,]
[Serde Library,org.apache.hadoop.hive.ql.io.orc.OrcSerde,]
[InputFormat,org.apache.hadoop.hive.ql.io.orc.OrcInputFormat,]
[OutputFormat,org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat,]
[Storage Properties,[serialization.format=1],]
[Partition Provider,Catalog,]
和配置单元表yearDF
的架构吗?
答案 0 :(得分:0)
您可以尝试使用此选项,而不是解析Hive表架构输出。
还读取Hive表作为数据框。假设此数据帧为df1
,而您的yearDF
为df2
。然后比较如下所示的模式。
如果两个数据帧之间列数的可能性也有所不同,则还应保持附加的df1.size == df2.size
比较if
循环。
val x = df1.schema.sortBy(x => x.name) // get dataframe 1 schema and sort it base on column name.
val y = df2.schema.sortBy(x => x.name) // // get dataframe 2 schema and sort it base on column name.
val out = x.zip(y).filter(x => x._1 != x._2) // zipping 1st column of df1, df2 ...2nd column of df1,df2 and so on for all columns and their datatypes. And filtering if any mismatch is there
if(out.size == 0) { // size of `out` should be 0 if matching
println("matching")
}
else println("not matching")