使用python返回map <string,string>类型的Hive变换

时间:2017-03-03 12:51:48

标签: python hadoop hive

我想插入一个从源表到目标表的转换而不转储内容或创建新表,视图等。 所以我开始考虑从原始表中流式传输内容,动态修改它并写入目标表:

INSERT OVERWRITE TABLE d SELECT TRANSFORM item USING 'python po.py' AS (item map<string,string>) FROM s;

其中d定义为

CREATE TABLE d (item map<string, string>)

和s定义为

CREATE TABLE s (item map<string, string>)

我应该从python脚本打印什么来正确地将数据转换并加载到表d?

我尝试从python脚本中打印不同的表示,但似乎生成的项目总是会导致格式错误:

这样的事情:

{"item":{"representation":null}}

1 个答案:

答案 0 :(得分:1)

您可以使用 str_to_map 返回特定格式的字符串并将其投射到地图。 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

演示

<强>的bash

cat>/tmp/myscript.sh
sed -r -e 's/\{(.*)\}/\1/' -e 's/"//g' -e 's/v(.)/v\100/g'

<强>蜂房

create table d (item map<string,string>);
create table s (item map<string,string>);
insert into s select map('k1','v1','k2','v2','k3','v3');
add file /tmp/myscript.sh;
insert into d

select  str_to_map (result)

from   (select  transform (item) using "myscript.sh" as result
        from    s
        ) t
;
select * from d
;
+---------------------------------------+
|                d.item                 |
+---------------------------------------+
| {"k1":"v100","k2":"v200","k3":"v300"} |
+---------------------------------------+

......为了清楚起见:

select * from s;
+---------------------------------+
|             s.item              |
+---------------------------------+
| {"k1":"v1","k2":"v2","k3":"v3"} |
+---------------------------------+
select  result
       ,str_to_map (result)  result_to_map

from   (select  transform (item) using "myscript.sh" as result

        from    s
        ) t
;
+-------------------------+---------------------------------------+
|         result          |              result_map               |
+-------------------------+---------------------------------------+
| k1:v100,k2:v200,k3:v300 | {"k1":"v100","k2":"v200","k3":"v300"} |
+-------------------------+---------------------------------------+
hive> explain
    > select  str_to_map (result)
    > 
    > from   (select  transform (item) using "myscript.sh" as result
    >         from    s
    >         ) t
    > ;
OK
Explain
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: s
            Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
            Select Operator
              expressions: item (type: map<string,string>)
              outputColumnNames: _col0
              Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
              Transform Operator
                command: myscript.sh
                output info:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
                Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                Select Operator
                  expressions: str_to_map(_col0) (type: map<string,string>)
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 17 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.TextInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink