我在蜂巢中创建一个外部表,然后将csv放在外部表指向的HDFS位置上。在检入Hue时,表输出的格式正确,但是当我尝试使用spark读取同一张表时,数据帧的第一行与标头相同,即标头重复了两次。
cdh版本:Hive 1.1.0-cdh5.13.1
DDL
CREATE EXTERNAL TABLE `dummy`(
name string,
age string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/tmp/dummy'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'numFiles'='1',
'numRows'='-1',
'rawDataSize'='-1',
'skip.header.line.count'='1')
csv
name,age
abc,10
色相输出
+----++----+
|name| age |
+----++----+
|abc | 10 |
+----++----+
火花输出
sparkSession.table('dummy')。show()
+----++----+
|name| age |
+----++----+
|name| age |
+----++----+
|abc | 10 |
+----++----+
Spark的预期输出
+----++----+
|name| age |
+----++----+
|abc | 10 |
+----++----+