Hive:将小写应用于数组

时间:2015-01-28 14:23:52

标签: java arrays string hive lowercase

在Hive中,如何将lower()UDF应用于字符串数组? 或者任何UDF。我不知道如何应用"地图"在选择查询中

1 个答案:

答案 0 :(得分:4)

如果您的用例是您正在单独转换数组(而不是作为表的一部分),那么explodelowercollect_list的组合应该执行特技。例如(请原谅可怕的执行时间,我在动力不足的虚拟机上运行):

hive> SELECT collect_list(lower(val))
    > FROM (SELECT explode(array('AN', 'EXAMPLE', 'ARRAY')) AS val) t;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 4 seconds 10 msec
Ended Job = job_1422453239049_0017
MapReduce Jobs Launched: 
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 4.01 sec   HDFS Read: 283 HDFS Write: 17 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 10 msec
OK
["an","example","array"]
Time taken: 33.05 seconds, Fetched: 1 row(s)

(注意:使用您用于生成数组的表达式替换上述查询中的array('AN', 'EXAMPLE', 'ARRAY')

相反,如果您的用例是您的数组存储在Hive表的列中,并且您需要对它们应用小写转换,据我所知,您有两个主要选项:

方法#1:使用explodeLATERAL VIEW的组合来分隔数组。使用lower转换单个元素,然后collect_list将它们粘合在一起。一个简单的例子,包含愚蠢的数据:

hive> DESCRIBE foo;
OK
id                          int                                 
data                        array<string>                       
Time taken: 0.774 seconds, Fetched: 2 row(s)
hive> SELECT * FROM foo;
OK
1001        ["ONE","TWO","THREE"]
1002        ["FOUR","FIVE","SIX","SEVEN"]
Time taken: 0.434 seconds, Fetched: 2 row(s)

hive> SELECT
    >   id, collect_list(lower(exploded))
    > FROM
    >   foo LATERAL VIEW explode(data) exploded_table AS exploded
    > GROUP BY id;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 3 seconds 310 msec
Ended Job = job_1422453239049_0014
MapReduce Jobs Launched:
Job 0: Map: 1  Reduce: 1   Cumulative CPU: 3.31 sec   HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 310 msec
OK
1001        ["one","two","three"]
1002        ["four","five","six","seven"]
Time taken: 34.268 seconds, Fetched: 2 row(s)

方法#2:编写一个简单的UDF来应用转换。类似的东西:

package my.package_name;

import java.util.ArrayList;
import java.util.List;

import org.apache.hadoop.hive.ql.exec.UDF;
import org.apache.hadoop.io.Text;

public class LowerArray extends UDF {
  public List<Text> evaluate(List<Text> input) {
    List<Text> output = new ArrayList<Text>();
    for (Text element : input) {
      output.add(new Text(element.toString().toLowerCase()));
    }
    return output;
  }
}

然后直接在数据上调用UDF:

hive> ADD JAR my_jar.jar;
Added my_jar.jar to class path
Added resource: my_jar.jar
hive> CREATE TEMPORARY FUNCTION lower_array AS 'my.package_name.LowerArray';
OK
Time taken: 2.803 seconds
hive> SELECT id, lower_array(data) FROM foo;
...
... Lots of MapReduce spam
...
MapReduce Total cumulative CPU time: 2 seconds 760 msec
Ended Job = job_1422453239049_0015
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 2.76 sec   HDFS Read: 358 HDFS Write: 44 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK  
1001        ["one","two","three"]
1002        ["four","five","six","seven"]
Time taken: 27.243 seconds, Fetched: 2 row(s)

这两种方法之间存在一些权衡。 #2在运行时通常比#1更有效,因为#1中的GROUP BY子句强制缩减阶段,而UDF方法则不然。但是,#1在HiveQL中执行所有操作并且更容易一般化(如果需要,可以在查询中使用其他类型的字符串转换替换lower)。使用#2的UDF方法,您可能必须为要应用的每种不同类型的转换编写新的UDF。