Hive DML事务(更新/删除)不适用于子查询

时间:2017-03-16 21:12:41

标签: hadoop merge hive dml

我知道Hive / Hadoop不适用于更新/删除,但我的要求是根据表person21的数据更新表person20。随着Hive与ORC的进步,它支持ACID,但它看起来还不成熟。

$ hive --version 

Hive 1.1.0-cdh5.6.0

以下是我为测试更新逻辑而执行的详细步骤。

CREATE TABLE person20(
  persid int,
  lastname string,
  firstname string)
CLUSTERED BY (
  persid)
INTO 1 BUCKETS
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
  'hdfs://hostname.com:8020/user/hive/warehouse/person20'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true',
  'numFiles'='3',
  'numRows'='2',
  'rawDataSize'='348',
  'totalSize'='1730',
  'transactional'='true',
  'transient_lastDdlTime'='1489668385')

插入声明:

INSERT INTO TABLE person20 VALUES (0,'PP','B'),(2,'X','Y');

选择声明:

set hive.cli.print.header=true;

select * from person20;

persid lastname  firstname
2       X       Y
0       PP      B

我有另一张表是person20的复制品,即person21:

CREATE TABLE person21(
  persid int,
  lastname string,
  firstname string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'hdfs://hostname.com:8020/user/hive/warehouse/person21'
TBLPROPERTIES (
  'COLUMN_STATS_ACCURATE'='true',
  'numFiles'='1',
  'numRows'='2',
  'rawDataSize'='11',
  'totalSize'='13',
  'transient_lastDdlTime'='1489668344')

插入声明:

INSERT INTO TABLE person20 VALUES (0,'SS','B'),(2,'X','Y');

选择声明:

select * from person21;

persid lastname firstname
2       X1       Y
0       SS       B

我想实现MERGE逻辑:

Merge into  person20 p20 USING person21 p21
ON (p20.persid=p21.persid)
WHEN MATCHED THEN
UPDATE set p20.lastname=p21.lastname
  • 但Merge不适用于我的HIVE Hive 1.1.0-cdh5.6.0版本。 这将从Hive 2.2开始提供。

其他选项是相关子查询更新: -

hive -e "set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 ; UPDATE person20 SET lastname = (select lastname from person21 where person21.lastname=person20.lastname);" 
  • 此声明给出以下错误: -
  

使用配置初始化日志记录   罐子:文件:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.6.0.jar /hive-log4j.properties   NoViableAltException(224 @ [400:1:precedenceEqualExpression:((left =   precedenceBitwiseOrExpression - > $ left)((KW_NOT   precedenceEqualNegatableOperator notExpr =   precedenceBitwiseOrExpression) - > ^(KW_NOT ^(   precedenceEqualNegatableOperator $ precedenceEqualExpression $ notExpr)   )| (precedenceEqualOperator equalExpr = precedenceBitwiseOrExpression   ) - > ^(precedenceEqualOperator $ precedenceEqualExpression $ equalExpr)   | (KW_NOT KW_IN LPAREN KW_SELECT)=> (KW_NOT KW_IN   subQueryExpression) - > ^(KW_NOT ^(TOK_SUBQUERY_EXPR ^(   TOK_SUBQUERY_OP KW_IN)subQueryExpression $ precedenceEqualExpression)   )| (KW_NOT KW_IN表达式) - > ^(KW_NOT ^(TOK_FUNCTION KW_IN   $ precedenceEqualExpression表达式))| (KW_IN LPAREN KW_SELECT   )=> (KW_IN subQueryExpression) - > ^(TOK_SUBQUERY_EXPR ^(   TOK_SUBQUERY_OP KW_IN)subQueryExpression $ precedenceEqualExpression)   | (KW_IN表达式) - > ^(TOK_FUNCTION KW_IN   $ precedenceEqualExpression表达式)| (KW_NOT KW_BETWEEN(分钟=   precedenceBitwiseOrExpression)KW_AND(max =   precedenceBitwiseOrExpression)) - > ^(TOK_FUNCTION   标识符["在"之间] KW_TRUE $ left $ min $ max)| (KW_BETWEEN(分钟=   precedenceBitwiseOrExpression)KW_AND(max =   precedenceBitwiseOrExpression)) - > ^(TOK_FUNCTION   标识符["介于"] KW_FALSE $ left $ min $ max))* | (KW_EXISTS   LPAREN KW_SELECT)=> (KW_EXISTS subQueryExpression) - > ^(   TOK_SUBQUERY_EXPR ^(TOK_SUBQUERY_OP KW_EXISTS)subQueryExpression)   );])           在org.antlr.runtime.DFA.noViableAlt(DFA.java:158)           在org.antlr.runtime.DFA.predict(DFA.java:116)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8651)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9673)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceAndExpression(HiveParser_IdentifiersParser.java:9792)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceOrExpression(HiveParser_IdentifiersParser.java:9951)           在org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.expression(HiveParser_IdentifiersParser.java:6567)           在org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6791)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceFieldExpression(HiveParser_IdentifiersParser.java:6862)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnaryPrefixExpression(HiveParser_IdentifiersParser.java:7247)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnarySuffixExpression(HiveParser_IdentifiersParser.java:7307)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceBitwiseXorExpression(HiveParser_IdentifiersParser.java:7491)           在org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceStarExpression(HiveParser_IdentifiersParser.java:7651)           at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedencePlusExpression(HiveParser_IdentifiersParser.java:7811)           在org.apache.hadoop.hive.ql.parse.HiveParser.precedencePlusExpression(HiveParser.java:44550)           在org.apache.hadoop.hive.ql.parse.HiveParser.columnAssignmentClause(HiveParser.java:44206)           在org.apache.hadoop.hive.ql.parse.HiveParser.setColumnsClause(HiveParser.java:44271)           在org.apache.hadoop.hive.ql.parse.HiveParser.updateStatement(HiveParser.java:44417)           在org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1616)           在org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1062)           在org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:201)           在org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)           在org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404)           在org.apache.hadoop.hive.ql.Driver.compile(Driver.java:305)           在org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1119)           在org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1167)           在org.apache.hadoop.hive.ql.Driver.run(Driver.java:1055)           在org.apache.hadoop.hive.ql.Driver.run(Driver.java:1045)           在org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207)           在org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159)           在org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370)           在org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305)           在org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:702)           在org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675)           在org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           at java.lang.reflect.Method.invoke(Method.java:606)           在org.apache.hadoop.util.RunJar.run(RunJar.java:221)           在org.apache.hadoop.util.RunJar.main(RunJar.java:136)FAILED:ParseException行1:33 无法识别附近的输入' select'   '姓' '从'在表达式规范中

我认为它不支持子查询。同样的陈述适用于常量。

hive -e "set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 ; UPDATE person20 SET lastname = 'PP' WHERE  persid = 0;"

- 此语句成功更新记录。

能帮助我找到在HIVE中执行DML /合并操作的最佳策略。

1 个答案:

答案 0 :(得分:1)

你可以通过蛮力来做到这一点:

  • 重新创建表person20但不重新创建ACID,在虚拟列名称上进行分区,并使用单个分区来虚拟'
  • 填充person20person21
  • 创建具有完全相同结构的工作表tmpperson20,并使用相同的'虚拟'分区为person20
  • INSERT INTO tmpperson20 PARTITION (dummy='dummy') SELECT p20.persid, p21.lastname, ... FROM person20 p20 JOIN person21 p21 ON p20.persid=p21.persid
  • INSERT INTO tmpperson20 PARTITION (dummy='dummy') SELECT * FROM person20 p20 WHERE NOT EXISTS (select p21.persid FROM person21 p21 WHERE p20.persid=p21.persid)
  • ALTER TABLE person20 DROP PARTITION (dummy='dummy')
  • ALTER TABLE person20 EXCHANGE PARTITION (dummy='dummy') WITH tmpperson20
  • 现在您可以放弃tmpperson20
但是,由于存在差异,可能会对ACID表更加棘手。

<小时/> 您也可以尝试使用迭代游标的过程语言并在循环中应用单个UPDATE。大量更新的效率非常低......

HPL/SQL实用程序随Hive 2.x一起提供,可能安装在Hive 1.x之上,但我没有机会尝试它。甲骨文方言在Hive上感觉很奇怪......!

或者您可以在循环中使用JDBC ResultSetPreparedStatement开发一些自定义Java代码。