我是初学Hive和Hadoop的人。我创建了一个表,它引用了包含文件的特定位置。
CREATE DATABASE IF NOT EXISTS <dbname>
LOCATION '/user/<username>/hive/<dbname>.db';
USE <dbname>;
CREATE EXTERNAL TABLE IF NOT EXISTS my_table (json STRING)
PARTITIONED BY (year INT, month INT, day INT)
STORED AS Parquet
LOCATION '/my-data/my/files';
此表有四列:年,月,日和json。
json看起来像是:
{
"t_id":"user.login",
"e_time":"2014-11-30T23:59:52Z",
"user_email_address":"someemail@email.com",
"la_id":"10",
"dbnum":16,
"remote_ip":"171.154.1.8",
"server_name":"some.server",
"protocol":"IMAPS",
"secure":true,
"result":"success"
}
一个有效的基本查询,如下所示:
SELECT json FROM mydb WHERE year=2015 AND month=12 LIMIT 10;
我想要做的是有一个where子句,我可以在其中过滤上面列出的json字段。我想它会如下所示,但它不起作用:
SELECT get_json_object(mytable.json, '$.t_id') as whatever
FROM mytable
WHERE year=2015 AND month=12 AND json like '%user.login%' LIMIT 1;
或者更好的是,能够像这样基于json进行查询:
SELECT COUNT(*)
FROM mytable
WHERE json.t_id = 'user.login'
AND json.someDate > ... and so on...
感谢任何建议。
答案 0 :(得分:1)
尝试此查询:
select b.t_id from my_table a lateral view json_tuple(a.json,'t_id') b as t_id where a.year=2015 and a.month=12 LIMIT 10;
你可以调用json_tuple中的另一个键,并在where子句中使用它。例如。:
select b.t_id from my_table a lateral view json_tuple(a.json,'t_id','result') b as t_id, result where a.year=2015 and a.month=12 and b.result ='true' LIMIT 10;
答案 1 :(得分:0)
您需要让JSON Serde以JSON格式读取数据。您实际上可以使用JSON格式创建表,然后查询普通表。
-- Add jar file using "add jar /path-to/hive-json-serde-0.2.jar"
CREATE EXTERNAL TABLE states_json (state_short_name string, state_full_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde'
LOCATION '/user/hduser/states.json';
states.json的数据类似于{&#34; state_short_name&#34;:&#34; CA&#34;,&#34; state_full_name&#34;:&#34; California&#34;}