我知道Hive / Hadoop不适用于更新/删除,但我的要求是根据表person21的数据更新表person20。随着Hive与ORC的进步,它支持ACID,但它看起来还不成熟。
$ hive --version
以下是我为测试更新逻辑而执行的详细步骤。
CREATE TABLE person20(
persid int,
lastname string,
firstname string)
CLUSTERED BY (
persid)
INTO 1 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
'hdfs://hostname.com:8020/user/hive/warehouse/person20'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='3',
'numRows'='2',
'rawDataSize'='348',
'totalSize'='1730',
'transactional'='true',
'transient_lastDdlTime'='1489668385')
插入声明:
INSERT INTO TABLE person20 VALUES (0,'PP','B'),(2,'X','Y');
选择声明:
set hive.cli.print.header=true;
select * from person20;
persid lastname firstname
2 X Y
0 PP B
我有另一张表是person20的复制品,即person21:
CREATE TABLE person21(
persid int,
lastname string,
firstname string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://hostname.com:8020/user/hive/warehouse/person21'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'numRows'='2',
'rawDataSize'='11',
'totalSize'='13',
'transient_lastDdlTime'='1489668344')
插入声明:
INSERT INTO TABLE person20 VALUES (0,'SS','B'),(2,'X','Y');
选择声明:
select * from person21;
persid lastname firstname
2 X1 Y
0 SS B
我想实现MERGE逻辑:
Merge into person20 p20 USING person21 p21
ON (p20.persid=p21.persid)
WHEN MATCHED THEN
UPDATE set p20.lastname=p21.lastname
其他选项是相关子查询更新: -
hive -e "set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 ; UPDATE person20 SET lastname = (select lastname from person21 where person21.lastname=person20.lastname);"
使用配置初始化日志记录 罐子:文件:/usr/lib/hive/lib/hive-common-1.1.0-cdh5.6.0.jar /hive-log4j.properties NoViableAltException(224 @ [400:1:precedenceEqualExpression:((left = precedenceBitwiseOrExpression - > $ left)((KW_NOT precedenceEqualNegatableOperator notExpr = precedenceBitwiseOrExpression) - > ^(KW_NOT ^( precedenceEqualNegatableOperator $ precedenceEqualExpression $ notExpr) )| (precedenceEqualOperator equalExpr = precedenceBitwiseOrExpression ) - > ^(precedenceEqualOperator $ precedenceEqualExpression $ equalExpr) | (KW_NOT KW_IN LPAREN KW_SELECT)=> (KW_NOT KW_IN subQueryExpression) - > ^(KW_NOT ^(TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_IN)subQueryExpression $ precedenceEqualExpression) )| (KW_NOT KW_IN表达式) - > ^(KW_NOT ^(TOK_FUNCTION KW_IN $ precedenceEqualExpression表达式))| (KW_IN LPAREN KW_SELECT )=> (KW_IN subQueryExpression) - > ^(TOK_SUBQUERY_EXPR ^( TOK_SUBQUERY_OP KW_IN)subQueryExpression $ precedenceEqualExpression) | (KW_IN表达式) - > ^(TOK_FUNCTION KW_IN $ precedenceEqualExpression表达式)| (KW_NOT KW_BETWEEN(分钟= precedenceBitwiseOrExpression)KW_AND(max = precedenceBitwiseOrExpression)) - > ^(TOK_FUNCTION 标识符["在"之间] KW_TRUE $ left $ min $ max)| (KW_BETWEEN(分钟= precedenceBitwiseOrExpression)KW_AND(max = precedenceBitwiseOrExpression)) - > ^(TOK_FUNCTION 标识符["介于"] KW_FALSE $ left $ min $ max))* | (KW_EXISTS LPAREN KW_SELECT)=> (KW_EXISTS subQueryExpression) - > ^( TOK_SUBQUERY_EXPR ^(TOK_SUBQUERY_OP KW_EXISTS)subQueryExpression) );]) 在org.antlr.runtime.DFA.noViableAlt(DFA.java:158) 在org.antlr.runtime.DFA.predict(DFA.java:116) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceEqualExpression(HiveParser_IdentifiersParser.java:8651) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceNotExpression(HiveParser_IdentifiersParser.java:9673) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceAndExpression(HiveParser_IdentifiersParser.java:9792) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceOrExpression(HiveParser_IdentifiersParser.java:9951) 在org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.expression(HiveParser_IdentifiersParser.java:6567) 在org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.atomExpression(HiveParser_IdentifiersParser.java:6791) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceFieldExpression(HiveParser_IdentifiersParser.java:6862) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnaryPrefixExpression(HiveParser_IdentifiersParser.java:7247) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceUnarySuffixExpression(HiveParser_IdentifiersParser.java:7307) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceBitwiseXorExpression(HiveParser_IdentifiersParser.java:7491) 在org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedenceStarExpression(HiveParser_IdentifiersParser.java:7651) at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.precedencePlusExpression(HiveParser_IdentifiersParser.java:7811) 在org.apache.hadoop.hive.ql.parse.HiveParser.precedencePlusExpression(HiveParser.java:44550) 在org.apache.hadoop.hive.ql.parse.HiveParser.columnAssignmentClause(HiveParser.java:44206) 在org.apache.hadoop.hive.ql.parse.HiveParser.setColumnsClause(HiveParser.java:44271) 在org.apache.hadoop.hive.ql.parse.HiveParser.updateStatement(HiveParser.java:44417) 在org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1616) 在org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1062) 在org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:201) 在org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166) 在org.apache.hadoop.hive.ql.Driver.compile(Driver.java:404) 在org.apache.hadoop.hive.ql.Driver.compile(Driver.java:305) 在org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1119) 在org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1167) 在org.apache.hadoop.hive.ql.Driver.run(Driver.java:1055) 在org.apache.hadoop.hive.ql.Driver.run(Driver.java:1045) 在org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:207) 在org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:159) 在org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:370) 在org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:305) 在org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:702) 在org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:675) 在org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:615) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) 在org.apache.hadoop.util.RunJar.run(RunJar.java:221) 在org.apache.hadoop.util.RunJar.main(RunJar.java:136)FAILED:ParseException行1:33 无法识别附近的输入' select' '姓' '从'在表达式规范中
我认为它不支持子查询。同样的陈述适用于常量。
hive -e "set hive.auto.convert.join.noconditionaltask.size = 10000000; set hive.support.concurrency = true; set hive.enforce.bucketing = true; set hive.exec.dynamic.partition.mode = nonstrict; set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads = 1 ; UPDATE person20 SET lastname = 'PP' WHERE persid = 0;"
- 此语句成功更新记录。
能帮助我找到在HIVE中执行DML /合并操作的最佳策略。
答案 0 :(得分:1)
你可以通过蛮力来做到这一点:
person20
但不重新创建ACID,在虚拟列名称上进行分区,并使用单个分区来虚拟' person20
和person21
tmpperson20
,并使用相同的'虚拟'分区为person20
INSERT INTO tmpperson20 PARTITION (dummy='dummy') SELECT p20.persid, p21.lastname, ... FROM person20 p20 JOIN person21 p21 ON p20.persid=p21.persid
INSERT INTO tmpperson20 PARTITION (dummy='dummy') SELECT * FROM person20 p20 WHERE NOT EXISTS (select p21.persid FROM person21 p21 WHERE p20.persid=p21.persid)
ALTER TABLE person20 DROP PARTITION (dummy='dummy')
ALTER TABLE person20 EXCHANGE PARTITION (dummy='dummy') WITH tmpperson20
tmpperson20
HPL/SQL实用程序随Hive 2.x一起提供,可能安装在Hive 1.x之上,但我没有机会尝试它。甲骨文方言在Hive上感觉很奇怪......!
或者您可以在循环中使用JDBC ResultSet
和PreparedStatement
开发一些自定义Java代码。