蜂巢计数元组?

时间:2013-03-21 19:42:31

标签: hive hiveql

我是HiveQL的新手,我有点卡住:S

我有一个以下架构的表。一个名为res的列和三个在partion_column下分区的名为filed的列。

create table results( res string) PARTITIONED BY (field STRING); 

然后我在此表中导入了数据

insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

我正在尝试计算三个分区中的独特tubles。

例如,此命令计算数据集中某些标题的存在数量。

 SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

并输出类似

的内容
 title                                count        
 Hit me Baby More time                   9

如何将其扩展为元组(标题,专辑,艺术家)?如果我想输出如下:

title                            album                 artist       count

Baby one more time    hit me baby one more time    britney spears    9

我的整个代码:

CREATE EXTERNAL TABLE IF NOT EXISTS hivetesttable  (
xmldata STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
location '/user/sdasd/hivetestdata/';

create view xmlout(line) as  select * from hivetesttable;  

CREATE VIEW TITLES(line) as select xpath(line,'/MC/SC/*/@ttl')  from xmlout;
CREATE VIEW ARTIST(line) as select  xpath(line,'/MC/SC/*/@art')  from xmlout;
CREATE VIEW ALBUMS( line) as select   xpath(line,'/MC/SC/*/@art') from xmlout;



create table results( res string) PARTITIONED BY (field STRING); 
insert overwrite table results PARTITION (field= 'title') SELECT  explode(line) AS myNewCol FROM titles ;
insert overwrite table results PARTITION (field= 'artist') SELECT  explode(line) AS myNewCol FROM artist;
insert overwrite table results PARTITION (field= 'albums') SELECT  explode(line) AS myNewCol FROM albums;

SELECT res, count(1) AS counttotal   FROM results where field='title' GROUP BY res ORDER BY counttotal;

xml数据的一行就像

<?xml version="1.0" encoding="UTF-8"?><MC><SC><S uid="2" gen="" yr="2011" art="Samsung" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Samsung/Music" alb="Samsung" ttl="Over the horizon"/><S uid="37" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="Whatcha Say"/><S uid="38" gen="" yr="2010" art="Jason Derulo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Jason Derulo/Jason Derulo" alb="Jason Derulo" ttl="In My Head"/><S uid="39" gen="" yr="2011" art="Alexandra Stan" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Alexandra Stan/Mr_ Saxobeat - Single" alb="Mr. Saxobeat - Single" ttl="Mr. Saxobeat (Extended Version)"/><S uid="40" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Wie ein Löwe"/><S uid="41" gen="" yr="2011" art="Bushido" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Bushido/Jenseits von Gut und Böse (Premium Edition)" alb="Jenseits von Gut und Böse (Premium Edition)" ttl="Verreckt"/><S uid="42" gen="" yr="2011" art="Lucenzo" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/Music/Lucenzo/Danza Kuduro (feat_ Don Omar) [From _Fast &amp; Furious 5_] - Single" alb="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;] - Single" ttl="Danza Kuduro (feat. Don Omar) [From &quot;Fast &amp; Furious 5&quot;]"/><S uid="121" gen="" yr="701" art="Michael Jackson" cmp="&lt;unknown&gt;" fld="/mnt/sdcard/external_sd/Music/Michael Jackson/Bad [Bonus Tracks]" alb="Bad [Bonus Tracks]" ttl="Voice-Over Intro/Quincy Jones Interview #1 [*]"/></SC><PC/></MC>

1 个答案:

答案 0 :(得分:1)

根据您提供的信息,您无法获得所需的输出。现在你有一个看起来像这样的表:

res                           field
---                           -----
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
baby one more time            title
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
hit me baby one more time     album
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
britney spears                artist
the distance                  title
the distance                  title
open book                     title
daria                         title
fashion nugget                album
fashion nugget                album
fashion nugget                album
fashion nugget                album
cake                          artist
cake                          artist
cake                          artist
cake                          artist

由于您对其进行了分区,因此Hive会将其存储在三个不同的文件夹中,但这不会影响查询结果。我添加了一些额外的曲目,我想象你想要输出的额外曲目(如果我错了,请纠正我):

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1

但是没有办法说出“开放式书籍”与“时尚金块”或“蛋糕”有什么关系,就像没有办法说“婴儿再一次”与“布兰妮斯皮尔斯”有关”。您可以尝试匹配计数但最终会得到类似的结果

title                  album                       artist              count
baby one more time     hit me baby one more time   britney spears      9
null                   fashion nuggets             cake                3
the distance           null                        null                1
open book,daria        null                        null                1

我想你想要一个包含这样的列的表

title                  album                         artist
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
baby one more          hit me baby one more time     britney spears
the distance           fashion nuggets               cake
the distance           fashion nuggets               cake
open book              fashion nuggets               cake
daria                  fashion nuggets               cake

但仍然可能分为艺术家和/或专辑。无论是否进行分区,您都可以编写查询,就好像表没有被分区一样(只要数据没有损坏,它就不会影响结果,只有性能)。但是,它会影响您创建和填充表格的方式。让我知道如果这是你想要的,我会编辑这个答案来代替回答这个问题。


编辑为:

好的,创建没有任何分区的表非常简单:

CREATE TABLE results (title string, album string, artist string)

使用with partitions创建表几乎一样简单,您只需要首先确定要分区的内容。如果您对艺术家进行分区,则意味着您可以运行特定于单个或一组艺术家的查询,而无需为其他艺术家处理信息。如果您按艺术家和专辑进行分区,您也可以对专辑进行相同的操作。这样做的代价是将大文件分成较小的文件,通常MapReduce(以及Hive)可以更好地处理大文件。我根本不担心分区,除非你处理至少10个GB,并且觉得你可以处理分区的工作原理和HiveQL。但为了完整性,艺术家划分:

CREATE TABLE results (title string, album string) PARTITIONED BY (artist string);

由艺术家分区,然后按专辑分区。按(artist string, album string) vs (album string, artist string)进行分区不会改变您的结果,但您应该先将层次结构的逻辑顶部放在首位。

CREATE TABLE (title string) PARTITIONED BY (artist string, album string);

如果我们有权访问的唯一信息来自表格titles, artists, and albums,那么填充此表并不容易,因为我们有大量的标题,艺术家和专辑,但没有办法说出哪个标题与以哪张专辑为例。我希望你有一些数据表明这些关系仍然完整,或者你的数据集仍然完好无损。在不知道这个假设数据的形式的情况下,我无法提供如何填充表格的答案。但是如果你有分区表,this answer可能对你有用,如果你不想手动指定每个艺术家和专辑(因为每个艺术家都有自己的分区,并且在分区内,每个专辑都有它自己的分区)

编辑:提问者有xml文件,其标题,ablum,arist关系完好无损。有关这方面的更多信息,请参阅评论。

现在问题的关键在于计算独特的元组。无论数据如何分区(如果有的话),这都是相同的。我们使用GROUP BY子句执行此操作。当您指定一个列(或分区,可以将其视为具有特殊属性的列)时,您将数据分解为具有该列的不同值的组。如果指定多个列,则可以将数据分成多个组,这些组的列组合具有不同的值。这是我们利用来计算不同的元组:

SELECT title, album, artist, COUNT(*)
FROM results
GROUP BY title, album, artist

我们在这里:

title                  album                       artist              count
baby one more time     hit me baby one mroe time   britney spears      9
the distance           fashion nuggets             cake                2
open book              fashion nuggets             cake                1
daria                  fashion nuggets             cake                1