我已创建创建镶木地板文件,然后我尝试将其导入Impala表。
我创建了如下表:
CREATE EXTERNAL TABLE `user_daily` (
`user_id` BIGINT COMMENT 'User ID',
`master_id` BIGINT,
`walletAgency` BOOLEAN,
`zone_id` BIGINT COMMENT 'Zone ID',
`day` STRING COMMENT 'The stats are aggregated for single days',
`clicks` BIGINT COMMENT 'The number of clicks',
`impressions` BIGINT COMMENT 'The number of impressions',
`avg_position` BIGINT COMMENT 'The average position * 100',
`money` BIGINT COMMENT 'The cost of the clicks, in hellers',
`web_id` BIGINT COMMENT 'Web ID',
`discarded_clicks` BIGINT COMMENT 'Number of discarded clicks from column "clicks"',
`impression_money` BIGINT COMMENT 'The cost of the impressions, in hellers'
)
PARTITIONED BY (
year BIGINT,
month BIGINT
)
STORED AS PARQUET
LOCATION '/warehouse/impala/contextstat.db/user_daily/';
然后我用这个模式复制那里的文件:
parquet-tools schema user_daily/year\=2016/month\=8/part-r-00001-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet
message spark_schema {
optional int32 user_id;
optional int32 web_id (INT_16);
optional int32 zone_id;
required int32 master_id;
required boolean walletagency;
optional int64 impressions;
optional int64 clicks;
optional int64 money;
optional int64 avg_position;
optional double impression_money;
required binary day (UTF8);
}
然后当我尝试用
查看条目时SELECT * FROM user_daily;
我得到了
File 'hdfs://.../warehouse/impala/contextstat.db/user_daily/year=2016/month=8/part-r-00000-fd77e1cd-c824-4ebd-9328-0aca5a168d11.snappy.parquet'
has an incompatible Parquet schema for column 'contextstat.user_daily.user_id'.
Column type: BIGINT, Parquet schema:
optional int32 user_id [i:0 d:1 r:0]
你知道如何解决这个问题吗?我认为BIGINT与int_32相同。我应该改变桌子的方案还是生成镶木地板文件?
答案 0 :(得分:2)
BIGINT是int64,这就是它抱怨的原因。但是你不一定要弄清楚你必须自己使用的不同类型,Impala可以为你做到这一点。只需使用CREATE TABLE LIKE PARQUET变体:
变体CREATE TABLE ... LIKE PARQUET'hdfs_path_of_parquet_file'允许您跳过CREATE TABLE语句的列定义。根据指定的Parquet数据文件的组织自动配置列名和数据类型,该文件必须已驻留在HDFS中。
答案 1 :(得分:0)
我使用CAST(... AS BIGINT)
,它将拼花图式从int32
更改为int64
。然后我必须重新排序列,因为它不会按名称加入。然后就行了。