我有一个包含这样的行的文件:
232404812.913232|1248|ip:tcp:jxta
232404812.913238|66|ip:udp:data
232404812.913615|98|ip:udp:l2tp:ppp:ip:tcp
我执行了以下HiveQL命令:
CREATE EXTERNAL TABLE b_packet (timestamp string, packet_length int, protocol string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "|"
LOCATION 's3://b-file/input/';
CREATE EXTERNAL TABLE b_packet_out (protocol string, cnt int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
LOCATION 's3://b-file/output/1/';
INSERT OVERWRITE TABLE b_packet_out SELECT 'overall',
COUNT(*) FROM b_packet GROUP BY protocol;
INSERT INTO TABLE b_packet_out SELECT 'tcp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:tcp';
INSERT INTO TABLE b_packet_out SELECT 'udp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:udp';
INSERT INTO TABLE b_packet_out SELECT 'icmp',
COUNT(*) FROM b_packet WHERE protocol REGEXP '^ip:icmp';
这样我在输出表中有以下内容。
hive> select * from b_packet_out;
OK
udp 2241
overall 10000
icmp 64
tcp 7633
HiveQL查询是否有更优雅的方式,因此我可以减少行数以获得相同的输出?
答案 0 :(得分:0)
select
count(*) as overall,
sum( if(protocol like '^ip:tcp',1,0) as tcp,
sum( if(protocol like '^ip:udp',1.0) as udp,
sum( if(protocol like '^ip:icmp'1,0) as icmp
from b_packet
一次传递数据就会生成相同的计数。
如果您有更多协议,您也可以说 选择 split(protocol,':')[1], 计数(*) 分组(协议,':')[1] 但这不会给出总体数。
答案 1 :(得分:0)
这是一个不同的解决方案,但它会对数据进行多次传递,并且不会真正为您节省代码行数:
SELECT CASE WHEN GROUPING__ID = 0 THEN 'overall' ELSE
CASE WHEN protocol LIKE 'ip:tcp%' THEN 'tcp'
WHEN protocol LIKE 'ip:udp%' THEN 'udp'
WHEN protocol LIKE 'ip:icmp%' THEN 'icmp' END END AS protocol
, COUNT(1) AS cnt
FROM b_packet
GROUP BY CASE WHEN protocol LIKE 'ip:tcp%' THEN 'tcp'
WHEN protocole LIKE 'ip:udp%' THEN 'udp'
WHEN protocol LIKE 'ip:icmp%' THEN 'icmp' END
GROUPING SETS (
(CASE WHEN protocol LIKE 'ip:tcp%' THEN 'tcp'
WHEN protocol LIKE 'ip:udp%' THEN 'udp'
WHEN protocol LIKE 'ip:icmp%' THEN 'icmp' END)
, ()
)