我正在尝试使用Spark SQL将CSV文件作为分区的Hive表加载并启动节俭服务器。这是我尝试过的:
def main(args: Array[String]): Unit = {
val conf = new SparkConf
conf
.set("hive.server2.thrift.port", "10000")
.set("spark.sql.hive.thriftServer.singleSession", "true")
.set("spark.sql.warehouse.dir", "hdfs://sql/metadata/hive")
.set("spark.sql.catalogImplementation","hive")
.set("skip.header.line.count","1")
.setMaster("local[*]")
.setAppName("ThriftServer")
val sc = new SparkContext(conf)
val spark = SparkSession.builder()
.config(conf)
.enableHiveSupport()
.getOrCreate()
spark.sql(
"CREATE TABLE IF NOT EXISTS freq (" +
"time_stamp bigint," +
"time_quality string )" +
"PARTITIONED BY (id int) "
"ROW FORMAT DELIMITED " +
"FIELDS TERMINATED BY ',' " +
"STORED AS TEXTFILE " +
"LOCATION 'Path_to_CSV_file' " +
"TBLPROPERTIES(skip.header.line.count = 1)"
)
使用上面的代码,将创建频率表并在其上加载数据,但不会基于 id 列对其进行分区。我也尝试过根据分区键更改表或插入数据,但是没有成功。
变更表:
spark.sql("ALTER TABLE freq ADD PARTITION (id) " +
"LOCATION 'PATH_TO_CSV_FILE' ")
ERROR: Found an empty partition key 'id'.(line 1, pos 33)
== SQL ==
ALTER TABLE freq ADD PARTITION (id) LOCATION 'Path_To_CSV_File'
插入表格:
spark.sql(
"INSERT OVERWRITE TABLE freq PARTITION (id) " +
"SELECT * " +
"FROM freq"
)
ERROR: Exception in thread "main" org.apache.spark.sql.AnalysisException: id is not a valid partition column in table `database`.`freq`.;
第一个问题是使用SparkSQL和Hive创建分区表的正确方法是什么?
此外,由于我已经预先在CSV文件中存储了数据,因此我不想创建它们的副本(由于容量限制),并且我只想拥有一个副本(无论采用CSV形式)文件或Hive分区子目录的形式)。有可能吗?