我正在尝试使用Spark / Scala将制表符分隔的txt文件中的数据加载到Hive中。这是我正在使用的代码:
object MasterData {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("Spark Hive Example")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("spark.sql.warehouse.dir", "hdfs://####:8020/warehouse/tablespace/managed/hive")
.enableHiveSupport()
.getOrCreate()
import spark.implicits._
import spark.sql
val sc = spark.sparkContext
val lines = sc.textFile("hdfs://####:8020/user/hive/product/masterdata/*.txt")
// Get table structure from header
val header = lines.first()
val schema = StructType(header.split("\t").map(fieldName => StructField(fieldName, StringType, nullable = true)))
val rdd = lines.filter(x => x != header).map(line => Row.fromSeq(line.split("\t").toSeq))
// Build DataFrame from RDD
val df = spark.createDataFrame(rdd, schema)
// Job seems to complete successfully, but warnings about nesting appear
df.repartition($"skufirstfour").write.format("parquet").mode("append").insertInto("product.masterdata")
val dfNew = spark.sql("SELECT * FROM product.masterdata")
dfNew.show() // Data is Empty here
println("Done!!")
}
}
创建DataFrame df
后,似乎结构正确。但是,将数据写入表product.masterdata
时,我从Spark收到此消息:
19/07/22 10:11:45 INFO FileOperations:预期的嵌套级别(2) 不存在 skufirstfour =%2F2-EN%2F / part-00094-67cea282-4852-493f-9507-23962e6838ab.c000 (从 hdfs:// ####:8020 / warehouse / tablespace / managed / hive / product.db / masterdata / .hive-staging_hive_2019-07-22_10-11-24_973_1641771327816565678-1 / -ext-10000 / skufirstfour =%2F2- EN%2F / part-00094-67cea282-4852-493f-9507-23962e6838ab.c000) 19/07/22 10:11:45 INFO FileOperations:预期的嵌套级别(2) 不存在 skufirstfour =%2F3-pk%2F / part-00171-67cea282-4852-493f-9507-23962e6838ab.c000 (从 hdfs:// ####:8020 / warehouse / tablespace / managed / hive / product.db / masterdata / .hive-staging_hive_2019-07-22_10-11-24_973_1641771327816565678-1 / -ext-10000 / skufirstfour =%2F3- pk%2F / part-00171-67cea282-4852-493f-9507-23962e6838ab.c000)
对于我尝试插入的每条记录,似乎都重复此消息。在最后一步,我尝试从表product.masterdata
中读取,但是它为空。是否有更好的方法使用Spark将数据插入Hive,或者是否可以设置一些配置来解决此嵌套问题?