从HIVE表中提取单个列

时间:2015-07-16 21:32:06

标签: hadoop hive hdfs

以下是来自HIVE表的选择查询:

select *  from test_aviation limit 5;                                                   
OK
2015    1   1   1   4   2015-01-01  AA  19805   AA  N787AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  06  California  91  0900    0855    -5.00   0.00    0.00    -1  0900-0959   17.00   0912    1230    7.00    1230    1237    7.00    7.00    0.00    0   1200-1259   0.00        0.00    390.00  402.00  378.00  1.00    2475.00 10          
2015    1   1   2   5   2015-01-02  AA  19805   AA  N795AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  06  California  91  0900    0850    -10.00  0.00    0.00    -1  0900-0959   15.00   0905    1202    9.00    1230    1211    -19.00  0.00    0.00    -2  1200-1259   0.00        0.00    390.00  381.00  357.00  1.00    2475.00 10          
2015    1   1   3   6   2015-01-03  AA  19805   AA  N788AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  06  California  91  0900    0853    -7.00   0.00    0.00    -1  0900-0959   15.00   0908    1138    13.00   1230    1151    -39.00  0.00    0.00    -2  1200-1259   0.00        0.00    390.00  358.00  330.00  1.00    2475.00 10          
2015    1   1   4   7   2015-01-04  AA  19805   AA  N791AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  06  California  91  0900    0853    -7.00   0.00    0.00    -1  0900-0959   14.00   0907    1159    19.00   1230    1218    -12.00  0.00    0.00    -1  1200-1259   0.00        0.00    390.00  385.00  352.00  1.00    2475.00 10          
2015    1   1   5   1   2015-01-05  AA  19805   AA  N783AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  06  California  91  0900    0853    -7.00   0.00    0.00    -1  0900-0959   27.00   0920    1158    24.00   1230    1222    -8.00   0.00    0.00    -1  1200-1259   0.00        0.00    390.00  389.00  338.00  1.00    2475.00 10          
Time taken: 0.067 seconds, Fetched: 5 row(s)

HIVE表的结构

hive> describe test_aviation;
OK
col_value               string                                      
Time taken: 0.221 seconds, Fetched: 1 row(s)

我想在不同的列中分隔整个表。我写了一个如下的查询来提取第12列:

SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 12)  from test_aviation;

输出:

Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1437067221195_0008, Tracking URL = http://localhost:8088/proxy/application_1437067221195_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1437067221195_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-07-17 02:46:56,215 Stage-1 map = 0%,  reduce = 0%
2015-07-17 02:47:27,650 Stage-1 map = 100%,  reduce = 0%
Ended Job = job_1437067221195_0008 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://localhost:8088/proxy/application_1437067221195_0008/
Examining task ID: task_1437067221195_0008_m_000000 (and more) from job job_1437067221195_0008

Task with the most failures(4): 
-----
Task ID:
  task_1437067221195_0008_m_000000

URL:
  http://localhost:8088/taskdetails.jsp?jobid=job_1437067221195_0008&tipid=task_1437067221195_0008_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
    at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
    at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
    ... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer)  on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract@4def4616 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {2015  1   1   1   4   2015-01-01  AA  19805   AA  N787AA  1   JFK New York     NY NY  36  New York    22  LAX Los Angeles  CA CA  06  California  91  0900    0855    -5.00   0.00    0.00    -1  0900-0959   17.00   0912    1230    7.00    1230    1237    7.00    7.00    0.00    0   1200-1259   0.00        0.00    390.00  402.00  378.00  1.00    2475.00 10          :java.lang.String, ^(?:([^,]*),?){1}:java.lang.String, 12:java.lang.Integer} of size 3
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1243)
    at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
    at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:166)
    at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
    at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
    at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:79)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
    at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
    at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
    at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
    ... 9 more
Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1219)
    ... 18 more
Caused by: java.lang.IndexOutOfBoundsException: No group 12
    at java.util.regex.Matcher.group(Matcher.java:487)
    at org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(UDFRegExpExtract.java:56)
    ... 23 more

请帮助我从HIVE表中提取不同的列。

1 个答案:

答案 0 :(得分:0)

试试这个:

select split(col_value,' ')[11] as column_12 from test_aviation;

假设你有空格分隔符。

'\\t' if tab 
'\\|' for pipe... 
':' 

等等