Hive根据条件组合列值

时间:2013-01-31 22:28:41

标签: hive

我想知道是否可以根据条件组合列值。让我解释一下......

假设我的数据看起来像这样

Id name offset
1 Jan 100
2 Janssen 104
3 Klaas 150
4 Jan 160
5 Janssen 164

我的输出应该是这个

Id fullname offsets
1 Jan Janssen [ 100, 160 ]

我想组合两行中的名称值,其中两行的偏移量不再分开,而是1个字符。

我的问题是,这种类型的数据操作是否可行,如果有人可以分享一些代码和解释?

请保持温和,但这段代码会回复一些我想要的东西......

    ArrayList<String> persons = new ArrayList<String>();

    // write your code here
    String _previous = "";

    //Sample output form entities.txt
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
    //USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
    File file = new File("entities.txt");

    try {
        //
        // Create a new Scanner object which will read the data
        // from the file passed in. To check if there are more
        // line to read from it we check by calling the
        // scanner.hasNextLine() method. We then read line one
        // by one till all line is read.
        //
        Scanner scanner = new Scanner(file);
        while (scanner.hasNextLine()) {

            if(_previous == "" || _previous == null)
                _previous = scanner.nextLine();

            String _current = scanner.nextLine();
            //Compare the lines, if there offset is = 1
            int x = Integer.parseInt(_previous.split(",")[3]) + Integer.parseInt(_previous.split(",")[4]);
            int y = Integer.parseInt(_current.split(",")[4]);
            if(y-x == 1){
                persons.add(_previous.split(",")[1] + " " + _current.split(",")[1]);
                if(scanner.hasNextLine()){
                    _current = scanner.nextLine();
                }
            }else{
                persons.add(_previous.split(",")[1]);
            }
            _previous = _current;
        }
    } catch (Exception e) {
        e.printStackTrace();
    }

    for(String person : persons){
        System.out.println(person);
    }

处理此片样本数据

USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Richard,PERSON,7,2732
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2740
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,2756
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3093
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,3195
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,3220
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,10660
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,10685
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,10858
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,11063
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Ken,PERSON,3,11186
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Marottoli,PERSON,9,11234
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Berkowitz,PERSON,9,17073
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Lea,PERSON,3,17095
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Stephanie,PERSON,9,17330
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4,Putt,PERSON,4,17340

产生此输出

Richard Marottoli
Marottoli
Marottoli
Marottoli
Berkowitz
Berkowitz
Marottoli
Lea
Lea
Ken
Marottoli
Berkowitz
Lea
Stephanie Putt

亲切的问候

1 个答案:

答案 0 :(得分:1)

使用下面的create table

加载表格
drop table if exists default.stack;
create external table default.stack
(junk string,
  name string,
 cat string,
 len int,
 off int
 )
 ROW FORMAT DELIMITED
 FIELDS terminated  by ','
 STORED AS INPUTFORMAT                                                  
  'org.apache.hadoop.mapred.TextInputFormat'                           
OUTPUTFORMAT                                                           
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' 
 location 'hdfs://nameservice1/....';

使用以下查询获得所需的输出。

select max(name), off from (
select CASE when b.name is not null then
            concat(b.name," ",a.name)
            else
            a.name
       end as name
       ,Case WHEN b.off1 is not null
             then b.off1
             else a.off
        end as off
from default.stack a
left outer join (select name 
                       ,len+off+ 1 as off
                       ,off as off1
                 from default.stack) b
on a.off = b.off ) a
group by off
order by off;

我测试了它,它会产生你想要的结果。