Question

我正在尝试从s3存储桶读取csv数据并在AWS Athena中创建表。我创建的表格无法跳过我的CSV文件的标题信息。

查询示例：

CREATE EXTERNAL TABLE IF NOT EXISTS table_name (   `event_type_id`
     string,   `customer_id` string,   `date` string,   `email` string )
     ROW FORMAT SERDE  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
     WITH
     SERDEPROPERTIES (   "separatorChar" = "|",   "quoteChar"     = "\"" )
     LOCATION 's3://location/' 
     TBLPROPERTIES ("skip.header.line.count"="1");

skip.header.line.count似乎不起作用。但这没有用。我认为Aws在这方面存在一些问题。我还有其他方法可以解决这个问题吗？

Answer 1

这适用于Redshift：

您想使用table properties ('skip.header.line.count'='1') 如果您愿意，还可以使用其他属性，例如'numRows'='100'。这是一个示例：

create external table exreddb1.test_table
(ID BIGINT 
,NAME VARCHAR
)
row format delimited
fields terminated by ','
stored as textfile
location 's3://mybucket/myfolder/'
table properties ('numRows'='100', 'skip.header.line.count'='1');

Answer 2

这是一个已知的缺陷。

我见过的最好的方法是tweeted by Eric Hammond：

...WHERE date NOT LIKE '#%'

这似乎在查询期间跳过标题行。我不确定它是如何工作的，但它可能是一种跳过NULL的方法。

Answer 3

截至今天（2019-11-18），来自OP的查询似乎有效。即skip.header.line.count被接受，并且第一行确实被跳过了。

当我们从s3中的csv文件读取数据并在aws athena中创建表时，如何跳过标题。

3 个答案: