这里我试图验证hive表的结构和存储在s3中的CSV文件。
这是数据框中的CSV文件架构。
+----+----------------+----+---+-----------+-----------+
|S_No| Variable|Type|Len| Format| Informat|
+----+----------------+----+---+-----------+-----------+
| 1| DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 2| LOAD_DATETIME| Num| 8|DATETIME20.|DATETIME20.|
| 3| SOURCE_BANK|Char| 1 | null| null|
| 4| EMP_NAME|Char| 50| null| null|
| 5|HEADER_ROW_COUNT| Num| 8| null| null|
| 6| EMP _HOURS| Num| 8| 15.2| 15.1|
+----+----------------+----+---+-----------+-----------+
我需要将它与O / p
进行比较import org.apache.spark.sql.hive.HiveContext
val targetTableName = "TableA"
val hc = new HiveContext(sc)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val targetRawData = hc.sql("Select datetime,load_datetime,trim(source_bank) as source_bank,trim(emp_name) as emp_name,header_row_count, emp_hours from " + targetTableName)
val schema= targetRawData.schema
是:schema:org.apache.spark.sql.types.StructType = StructType(StructField(datetime,TimestampType,true), StructField(load_datetime,TimestampType,true), StructField(source_bank,StringType,true), StructField(emp_name,StringType,true), StructField(header_row_count,IntegerType,true), StructField(emp_hours,DoubleType,true))
答案 0 :(得分:0)
你也可以使用MegaSparDiff作为开源来比较多种类型的数据源,包括S3,HIVE,CSV,JDBC等。
https://github.com/FINRAOS/MegaSparkDiff
以下对将返回inLeftButNotInRight和inRightButNotInLeft作为DataFrames。
您的成功条件是两个DataFrame都没有记录,这意味着数据是相同的。 现在您可能不想加载所有数据,因为您只对比较架构感兴趣。
SparkFactory.initializeSparkContext();
AppleTable leftAppleTable = SparkFactory.parallelizeTextSource("S3://file1","table1");
AppleTable rightAppleTable = SparkFactory.parallelizeHiveSource("select * from hivetable","hivetable");
Pair<Dataset<Row>, Dataset<Row>> resultPair = SparkCompare.compareAppleTables(leftAppleTable, rightAppleTable);
if (resultPair.getLeft().count() != 0 && resultPair.getRight().count() != 0)
{
//success condition
}
SparkFactory.stopSparkContext();
答案 1 :(得分:0)
Naveen
you can follow the below steps:
define a class:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SQLImplicits
import org.apache.spark.sql._
class columnMatch(spark:SparkSession,inputdf: DataFrame, requiredColNames: Array[String]) {
val missingColumns = requiredColNames.diff(inputdf.columns)
def missingColumnsMessage(): String = {
val missingColNames = missingColumns.mkString(", ")
val allColNames = inputdf.columns.mkString(", ")
s"The [${missingColNames}] columns are not included in the DataFrame with the following columns [${allColNames}]"
def validatePresenceOfColumns(): String = {
if (missingColumns.nonEmpty) {
missingColumnsMessage()
}
else{
s"No Mismatch in the column Name Found"
}
}
}