使用现有的CSV文件创建分区的Hive表

时间:2019-02-12 09:52:39

标签: hive apache-spark-sql

我正在尝试使用Spark SQL将CSV文件作为分区的Hive表加载并启动节俭服务器。这是我尝试过的:

def main(args: Array[String]): Unit = {
    val conf = new SparkConf
    conf
      .set("hive.server2.thrift.port", "10000")
      .set("spark.sql.hive.thriftServer.singleSession", "true")
      .set("spark.sql.warehouse.dir", "hdfs://sql/metadata/hive")
      .set("spark.sql.catalogImplementation","hive")
      .set("skip.header.line.count","1")
      .setMaster("local[*]")
      .setAppName("ThriftServer")
    val sc = new SparkContext(conf)
    val spark = SparkSession.builder()
      .config(conf)
      .enableHiveSupport()
      .getOrCreate()    

    spark.sql(
      "CREATE TABLE IF NOT EXISTS freq (" +
        "time_stamp bigint," +
        "time_quality string )" +
        "PARTITIONED BY (id int) "
        "ROW FORMAT DELIMITED " +
        "FIELDS TERMINATED BY ',' " +
        "STORED AS TEXTFILE " +
        "LOCATION 'Path_to_CSV_file' " +
        "TBLPROPERTIES(skip.header.line.count = 1)"
    )

使用上面的代码,将创建频率表并在其上加载数据,但不会基于 id 列对其进行分区。我也尝试过根据分区键更改表或插入数据,但是没有成功。

变更表:

spark.sql("ALTER TABLE freq ADD PARTITION (id) " +
      "LOCATION 'PATH_TO_CSV_FILE' ")
ERROR: Found an empty partition key 'id'.(line 1, pos 33)

== SQL ==
ALTER TABLE freq ADD PARTITION (id) LOCATION 'Path_To_CSV_File'

插入表格:

spark.sql(
      "INSERT OVERWRITE TABLE freq PARTITION (id) " +
      "SELECT * " +
      "FROM freq"
    )
ERROR: Exception in thread "main" org.apache.spark.sql.AnalysisException: id is not a valid partition column in table `database`.`freq`.;

第一个问题是使用SparkSQL和Hive创建分区表的正确方法是什么?

此外,由于我已经预先在CSV文件中存储了数据,因此我不想创建它们的副本(由于容量限制),并且我只想拥有一个副本(无论采用CSV形式)文件或Hive分区子目录的形式)。有可能吗?

0 个答案:

没有答案