我创建了一个hive外部表,存储为由event_date Date分区的文本文件。
在从Hive表中读取spark时,我们如何指定特定格式的csv?
环境
1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
2. Hive 1.1, CDH 5.5.1
scala脚本
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3, 3))).toDF
val distData_1 = distData.withColumn("event_date", current_date())
distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int, event_date: date]
scala > distData_1.show
+ ---+---+---+----------+
|_1 |_2 |_3 | event_date |
| 1 | 1 | 1 | 2016-03-25 |
| 2 | 2 | 2 | 2016-03-25 |
| 3 | 3 | 3 | 2016-03-25 |
distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")
scala > sqlContext.sql("select * from part_table").show
| a | b | c | event_date |
|1,1,1 | null | null | 2016-03-25 |
|2,2,2 | null | null | 2016-03-25 |
|3,3,3 | null | null | 2016-03-25 |
Hive表
create external table part_table (a String, b int, c bigint)
partitioned by (event_date Date)
row format delimited fields terminated by ','
stored as textfile LOCATION "/user/hdfs/hive/part_table";
select * from part_table shows
|part_table.a | part_table.b | part_table.c | part_table.event_date |
|1 |1 |1 |2016-03-25
|2 |2 |2 |2016-03-25
|3 |3 |3 |2016-03-25
查看hdfs
The path has 2 part files /user/hdfs/hive/part_table/event_date=2016-03-25
part-00000
part-00001
part-00000 content
1,1,1
part-00001 content
2,2,2
3,3,3
P.S。如果我们将表存储为orc,它会按预期写入和读取数据。
如果'字段终止'是默认值,那么Spark可以按预期读取数据,因此我猜这将是一个错误。