尝试从Spark / Scala将数据插入Hive表时的预期嵌套错误级别

时间:2019-07-22 17:36:06

标签: scala apache-spark hadoop hive

我正在尝试使用Spark / Scala将制表符分隔的txt文件中的数据加载到Hive中。这是我正在使用的代码:

object MasterData {
  def main(args: Array[String]) {
    val spark = SparkSession
      .builder()
      .appName("Spark Hive Example")
      .config("hive.exec.dynamic.partition", "true")
      .config("hive.exec.dynamic.partition.mode", "nonstrict")
      .config("spark.sql.warehouse.dir", "hdfs://####:8020/warehouse/tablespace/managed/hive")
      .enableHiveSupport()
      .getOrCreate()

    import spark.implicits._
    import spark.sql

    val sc = spark.sparkContext

    val lines = sc.textFile("hdfs://####:8020/user/hive/product/masterdata/*.txt")

    // Get table structure from header
    val header = lines.first()
    val schema = StructType(header.split("\t").map(fieldName => StructField(fieldName, StringType, nullable = true)))
    val rdd = lines.filter(x => x != header).map(line => Row.fromSeq(line.split("\t").toSeq))

    // Build DataFrame from RDD
    val df = spark.createDataFrame(rdd, schema)

    // Job seems to complete successfully, but warnings about nesting appear
    df.repartition($"skufirstfour").write.format("parquet").mode("append").insertInto("product.masterdata")

    val dfNew = spark.sql("SELECT * FROM product.masterdata")
    dfNew.show() // Data is Empty here

    println("Done!!")
  }
}

创建DataFrame df后,似乎结构正确。但是,将数据写入表product.masterdata时,我从Spark收到此消息:

  

19/07/22 10:11:45 INFO FileOperations:预期的嵌套级别(2)   不存在   skufirstfour =%2F2-EN%2F / part-00094-67cea282-4852-493f-9507-23962e6838ab.c000   (从   hdfs:// ####:8020 / warehouse / tablespace / managed / hive / product.db / masterdata / .hive-staging_hive_2019-07-22_10-11-24_973_1641771327816565678-1 / -ext-10000 / skufirstfour =%2F2- EN%2F / part-00094-67cea282-4852-493f-9507-23962e6838ab.c000)   19/07/22 10:11:45 INFO FileOperations:预期的嵌套级别(2)   不存在   skufirstfour =%2F3-pk%2F / part-00171-67cea282-4852-493f-9507-23962e6838ab.c000   (从   hdfs:// ####:8020 / warehouse / tablespace / managed / hive / product.db / masterdata / .hive-staging_hive_2019-07-22_10-11-24_973_1641771327816565678-1 / -ext-10000 / skufirstfour =%2F3- pk%2F / part-00171-67cea282-4852-493f-9507-23962e6838ab.c000)

对于我尝试插入的每条记录,似乎都重复此消息。在最后一步,我尝试从表product.masterdata中读取,但是它为空。是否有更好的方法使用Spark将数据插入Hive,或者是否可以设置一些配置来解决此嵌套问题?

0 个答案:

没有答案