HDP沙盒IOException在DataFrame上调用saveAsTable

时间:2019-02-27 13:31:51

标签: apache-spark hive sandbox

我正在尝试运行以下示例,该示例尝试从Spark DataFrame创建Hive表。当我使用master = local调用spark-submit时,代码有效,但是当我使用master = yarn调用spark-submit时,它将引发异常。 这是调用: spark-submit --class test.sandbox.HDPRiskFactor --master yarn --name“ Risk Factor” ./hdprisk-0.0.1-SNAPSHOT.jar 此外,我从Hive控制台创建了一个名为“ default.geolocation”的表,但是当我调用show()时,无法从spark中看到它。我试图将Yarn模式下的执行程序计数设置为0,但这也不起作用。 1)为什么该代码只能用于本地主机,但可以用于纱线 2)为什么我无法从我的Spark代码中看到在蜂巢中创建的表。

 def main(args: Array[String]): Unit = {

val spark = SparkSession.builder().getOrCreate()
//    val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val hadoopconf = new Configuration()
val hdfs = FileSystem.get(hadoopconf)
val csvDataDir = "/tmp/data"
//import spark.implicits._
val dataList = List(("geolocation", "csv"), ("trucks", "csv"))
listFiles(this.getClass.getClassLoader.getResource(".").getFile)
dataList.map(path => {
  val localFile = path._1 + "." + path._2
  val hdfsFile = csvDataDir + "/" + path._1 + "." + path._2
  if (!testDirExist(hdfs, hdfsFile)) copyStreamToHdfs(hdfs, "/root/", csvDataDir, localFile)
})
val geoLocationDF = spark.read.format("csv").option("header", "true").load("hdfs:///tmp/data/geolocation.csv")

// Now that we have the data loaded into a DataFrame, we can register a temporary view.
spark.sql("SHOW TABLES").show()
geoLocationDF.write.format("orc").saveAsTable("default.geolocation")
//      geoLocationDF.createOrReplaceTempView("geolocation")

spark.sql("select * from default.geolocation").show()

}

1 个答案:

答案 0 :(得分:0)

我没有正确配置配置单元上下文。因此它正在将文件写入根目录。解决方案是传递适当的配置参数:

val spark = SparkSession.builder()
  .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
  .config("spark.sql.sources.maxConcurrentWrites","1")
  .config("spark.sql.parquet.compression.codec", "snappy")
  .config("hive.exec.dynamic.partition", "true")
  .config("hive.exec.dynamic.partition.mode", "nonstrict")
  .config("parquet.compression", "SNAPPY")
  .config("hive.exec.max.dynamic.partitions", "3000")
  .config("parquet.enable.dictionary", "false")
  .config("hive.support.concurrency", "true")
  .enableHiveSupport()
  .getOrCreate()