我是HiveQL的新手,我有点卡住:S
我有一个以下架构的表。一个名为res的列和三个在partion_column下分区的名为filed的列。
create table results( res string) PARTITIONED BY (field STRING);
然后我在此表中导入了数据
insert overwrite table results PARTITION (field= 'title') SELECT explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT explode(line) AS myNewCol FROM albums;
我正在尝试计算三个分区中的独特tubles。
例如,此命令计算数据集中某些标题的存在数量。
SELECT res, count(1) AS counttotal FROM results where field='title' GROUP BY res ORDER BY counttotal;
并输出类似
的内容 title count
Hit me Baby More time 9
如何将其扩展为元组(标题,专辑,艺术家)?如果我想输出如下:
title album artist count
Baby one more time hit me baby one more time britney spears 9
我的整个代码:
CREATE EXTERNAL TABLE IF NOT EXISTS hivetesttable (
xmldata STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/sdasd/hivetestdata/';
create view xmlout(line) as select * from hivetesttable;
CREATE VIEW TITLES(line) as select xpath(line,'/MC/SC/*/@ttl') from xmlout;
CREATE VIEW ARTIST(line) as select xpath(line,'/MC/SC/*/@art') from xmlout;
CREATE VIEW ALBUMS( line) as select xpath(line,'/MC/SC/*/@art') from xmlout;
create table results( res string) PARTITIONED BY (field STRING);
insert overwrite table results PARTITION (field= 'title') SELECT explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT explode(line) AS myNewCol FROM albums;
SELECT res, count(1) AS counttotal FROM results where field='title' GROUP BY res ORDER BY counttotal;
xml数据的一行就像
<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="<unknown>" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="<unknown>" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="Whatcha Say"/><S uid="38" gen="" yr="2010" art="Jason Derulo" cmp="<unknown>" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="In My Head"/><S uid="39" gen="" yr="2011" art="Alexandra Stan" cmp="<unknown>" fld="/mnt/sdcard/Music/Alexandra Stan/Mr_ Saxobeat - Single" alb="Mr. Saxobeat - Single" ttl="Mr. Saxobeat (Extended Version)"/><S uid="40" gen="" yr="2011" art="Bushido" cmp="<unknown>" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Wie ein Löwe"/><S uid="41" gen="" yr="2011" art="Bushido" cmp="<unknown>" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Verreckt"/><S uid="42" gen="" yr="2011" art="Lucenzo" cmp="<unknown>" fld="/mnt/sdcard/Music/Lucenzo/Danza Kuduro (feat_ Don Omar) [From _Fast & Furious 5_] - Single" alb="Danza Kuduro (feat. Don Omar) [From "Fast & Furious 5"] - Single" ttl="Danza Kuduro (feat. Don Omar) [From "Fast & Furious 5"]"/><S uid="121" gen="" yr="701" art="Michael Jackson" cmp="<unknown>" fld="/mnt/sdcard/external_sd/Music/Michael Jackson/Bad [Bonus Tracks]" alb="Bad [Bonus Tracks]" ttl="Voice-Over Intro/Quincy Jones Interview #1 [*]"/></SC><PC/></MC>
答案 0 :(得分:1)
根据您提供的信息,您无法获得所需的输出。现在你有一个看起来像这样的表:
res field
--- -----
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
baby one more time title
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
hit me baby one more time album
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
britney spears artist
the distance title
the distance title
open book title
daria title
fashion nugget album
fashion nugget album
fashion nugget album
fashion nugget album
cake artist
cake artist
cake artist
cake artist
由于您对其进行了分区,因此Hive会将其存储在三个不同的文件夹中,但这不会影响查询结果。我添加了一些额外的曲目,我想象你想要输出的额外曲目(如果我错了,请纠正我):
title album artist count
baby one more time hit me baby one mroe time britney spears 9
the distance fashion nuggets cake 2
open book fashion nuggets cake 1
daria fashion nuggets cake 1
但是没有办法说出“开放式书籍”与“时尚金块”或“蛋糕”有什么关系,就像没有办法说“婴儿再一次”与“布兰妮斯皮尔斯”有关”。您可以尝试匹配计数但最终会得到类似的结果
title album artist count
baby one more time hit me baby one more time britney spears 9
null fashion nuggets cake 3
the distance null null 1
open book,daria null null 1
我想你想要一个包含这样的列的表
title album artist
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
baby one more hit me baby one more time britney spears
the distance fashion nuggets cake
the distance fashion nuggets cake
open book fashion nuggets cake
daria fashion nuggets cake
但仍然可能分为艺术家和/或专辑。无论是否进行分区,您都可以编写查询,就好像表没有被分区一样(只要数据没有损坏,它就不会影响结果,只有性能)。但是,它会影响您创建和填充表格的方式。让我知道如果这是你想要的,我会编辑这个答案来代替回答这个问题。
编辑为:
好的,创建没有任何分区的表非常简单:
CREATE TABLE results (title string, album string, artist string)
使用with partitions创建表几乎一样简单,您只需要首先确定要分区的内容。如果您对艺术家进行分区,则意味着您可以运行特定于单个或一组艺术家的查询,而无需为其他艺术家处理信息。如果您按艺术家和专辑进行分区,您也可以对专辑进行相同的操作。这样做的代价是将大文件分成较小的文件,通常MapReduce(以及Hive)可以更好地处理大文件。我根本不担心分区,除非你处理至少10个GB,并且觉得你可以处理分区的工作原理和HiveQL。但为了完整性,艺术家划分:
CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);
由艺术家分区,然后按专辑分区。按(artist string, album string)
vs (album string, artist string)
进行分区不会改变您的结果,但您应该先将层次结构的逻辑顶部放在首位。
CREATE TABLE (title string) PARTITIONED BY (artist string, album string);
如果我们有权访问的唯一信息来自表格titles, artists, and albums
,那么填充此表并不容易,因为我们有大量的标题,艺术家和专辑,但没有办法说出哪个标题与以哪张专辑为例。我希望你有一些数据表明这些关系仍然完整,或者你的数据集仍然完好无损。在不知道这个假设数据的形式的情况下,我无法提供如何填充表格的答案。但是如果你有分区表,this answer可能对你有用,如果你不想手动指定每个艺术家和专辑(因为每个艺术家都有自己的分区,并且在分区内,每个专辑都有它自己的分区)
编辑:提问者有xml文件,其标题,ablum,arist关系完好无损。有关这方面的更多信息,请参阅评论。
现在问题的关键在于计算独特的元组。无论数据如何分区(如果有的话),这都是相同的。我们使用GROUP BY
子句执行此操作。当您指定一个列(或分区,可以将其视为具有特殊属性的列)时,您将数据分解为具有该列的不同值的组。如果指定多个列,则可以将数据分成多个组,这些组的列组合具有不同的值。这是我们利用来计算不同的元组:
SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist
我们在这里:
title album artist count
baby one more time hit me baby one mroe time britney spears 9
the distance fashion nuggets cake 2
open book fashion nuggets cake 1
daria fashion nuggets cake 1