我有一个带有底层文件的外部分区Hive表ROW FORMAT DELIMITED FIELDS TERMINATED BY'|' 通过Hive直接读取数据就好了,但是当使用Spark的Dataframe API时,分隔符“|”没有考虑到。
创建外部分区表:
hive> create external table external_delimited_table(value1 string, value2 string)
partitioned by (year string, month string, day string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
location '/client/edb/poc_database/external_delimited_table';
创建仅包含一行的数据文件并将其放置到外部分区表位置:
shell>echo "one|two" >> table_data.csv
shell>hadoop fs -mkdir -p /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20
shell>hadoop fs -copyFromLocal table_data.csv /client/edb/poc_database/external_delimited_table/year=2016/month=08/day=20
激活分区:
hive> alter table external_delimited_table add partition (year='2016',month='08',day='20');
完整性检查:
hive> select * from external_delimited_table;
select * from external_delimited_table;
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| external_delimited_table.value1 | external_delimited_table.value2 | external_delimited_table.year | external_delimited_table.month | external_delimited_table.day |
+----------------------------------+----------------------------------+--------------------------------+---------------------------------+-------------------------------+--+
| one | two | 2016 | 08 | 20
Spark代码:
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.{SparkContext, SparkConf}
object TestHiveContext {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Test Hive Context")
val spark = new SparkContext(conf)
val hiveContext = new HiveContext(spark)
val dataFrame: DataFrame = hiveContext.sql("SELECT * FROM external_delimited_table")
dataFrame.show()
spark.stop()
}
dataFrame.show()输出:
+-------+------+----+-----+---+
| value1|value2|year|month|day|
+-------+------+----+-----+---+
|one|two| null|2016| 08| 20|
+-------+------+----+-----+---+
答案 0 :(得分:2)
这对Spark版本1.5.0来说是一个问题。在版本1.6.0中,问题没有发生:
scala> sqlContext.sql("select * from external_delimited_table")
res2: org.apache.spark.sql.DataFrame = [value1: string, value2: string, year: string, month: string, day: string]
scala> res2.show
+------+------+----+-----+---+
|value1|value2|year|month|day|
+------+------+----+-----+---+
| one| two|2016| 08| 20|
+------+------+----+-----+---+