我正在使用Hive(版本0.11.0)并尝试连接两个表。 一个有26,286,629个记录,另一个有931个记录。
这是我试图运行的查询。
SELECT *
FROM table_1 hapmap a tabl_2 b
WHERE a.chrom = b.chrom
AND a.start_pos >= b.start_pos
AND a.end_pos <= b.end_pos
LIMIT 10
;
在前几分钟看起来很好,但如果map和reduce任务都达到了100%,那么它会从0%开始再次降低。
2014-02-20 20:38:35,531 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:36,533 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:37,536 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:38,539 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 399.23 sec
2014-02-20 20:38:39,541 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:40,544 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:41,547 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:42,550 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 192.49 sec
2014-02-20 20:38:43,554 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 192.49 sec
以下是map和reduce任务的最后4KB日志。
映射任务日志(最后4KB)
.hadoop.mapred.MapTask: bufstart = 99180857; bufend = 16035989; bufvoid = 99614720
2014-02-20 19:57:03,008 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196599; kvend = 131062; length = 327680
2014-02-20 19:57:03,180 INFO org.apache.hadoop.mapred.MapTask: Finished spill 12
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: bufstart = 16035989; bufend = 32544041; bufvoid = 99614720
2014-02-20 19:57:04,244 INFO org.apache.hadoop.mapred.MapTask: kvstart = 131062; kvend = 65525; length = 327680
2014-02-20 19:57:04,399 INFO org.apache.hadoop.mapred.MapTask: Finished spill 13
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: record full = true
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: bufstart = 32544041; bufend = 48952648; bufvoid = 99614720
2014-02-20 19:57:05,440 INFO org.apache.hadoop.mapred.MapTask: kvstart = 65525; kvend = 327669; length = 327680
2014-02-20 19:57:05,598 INFO org.apache.hadoop.mapred.MapTask: Finished spill 14
2014-02-20 19:57:05,767 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 forwarding 4000000 rows
2014-02-20 19:57:05,767 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarding 4000000 rows
2014-02-20 19:57:05,767 INFO ExecMapper: ExecMapper: processing 4000000 rows: used memory = 123701072
2014-02-20 19:57:06,562 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 finished. closing...
2014-02-20 19:57:06,574 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 forwarded 4182243 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 finished. closing...
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 finished. closing...
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 3 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 2 Close done
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.MapOperator: DESERIALIZE_ERRORS:0
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 finished. closing...
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 forwarded 4182243 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 finished. closing...
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.ReduceSinkOperator: 1 forwarded 0 rows
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.TableScanOperator: 0 Close done
2014-02-20 19:57:06,575 INFO org.apache.hadoop.hive.ql.exec.MapOperator: 9 Close done
2014-02-20 19:57:06,575 INFO ExecMapper: ExecMapper: processed 4182243 rows: used memory = 128772720
2014-02-20 19:57:06,577 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
2014-02-20 19:57:06,713 INFO org.apache.hadoop.mapred.MapTask: Finished spill 15
2014-02-20 19:57:06,720 INFO org.apache.hadoop.mapred.Merger: Merging 16 sorted segments
2014-02-20 19:57:06,730 INFO org.apache.hadoop.mapred.Merger: Merging 7 intermediate segments out of a total of 16
2014-02-20 19:57:08,308 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 10 segments left of total size: 272242546 bytes
2014-02-20 19:57:11,762 INFO org.apache.hadoop.mapred.Task: Task:attempt_201402201604_0005_m_000000_0 is done. And is in the process of commiting
2014-02-20 19:57:11,834 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201402201604_0005_m_000000_0' done.
2014-02-20 19:57:11,837 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2014-02-20 19:57:11,868 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
2014-02-20 19:57:11,869 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName root for UID 0 from the native implementation
减少任务日志(最后4KB)
x.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)
2014-02-20 20:11:32,743 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for null bad datanode[0] nodes == null
2014-02-20 20:11:32,743 WARN org.apache.hadoop.hdfs.DFSClient: Could not get block locations. Source file "/tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0" - Aborting...
2014-02-20 20:11:32,744 ERROR org.apache.hadoop.hdfs.DFSClient: Failed to close file /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0
org.apache.hadoop.ipc.RemoteException:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1999)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1990)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1899)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
at org.apache.hadoop.ipc.Client.call(Client.java:1113)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
at com.sun.proxy.$Proxy2.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3720)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:3580)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2600(DFSClient.java:2783)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3023)
Namenode登录(2014-02-20 20:11:32)
2014-02-20 20:01:28,769 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 93 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 46 SyncTimes(ms): 29
2014-02-20 20:11:32,984 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 94 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 47 SyncTimes(ms): 30
2014-02-20 20:11:32,987 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:root cause:org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
2014-02-20 20:11:32,987 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9000, call addBlock(/tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0, DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1, null) from 172.27.250.92:55640: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /tmp/hive-root/hive_2014-02-20_19-56-40_541_484791785779427461/_task_tmp.-ext-10001/_tmp.000000_0 File does not exist. Holder DFSClient_attempt_201402201604_0005_r_000000_0_636140892_1 does not have any open files
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1999)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1990)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1899)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:783)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:587)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1432)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1428)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1426)
2014-02-20 20:14:48,995 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 96 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 48 SyncTimes(ms): 31
2014-02-20 20:18:28,063 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 172.27.114.218
2014-02-20 20:18:28,064 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 96 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 49 SyncTimes(ms): 32
任何人都可以帮助我?
提前致谢。
答案 0 :(得分:4)
Hive 11无法从WHERE子句推断出JOIN条件。您编写的查询将作为交叉产品连接执行,然后根据where条件进行过滤。这是非常昂贵的,你应该使用它:
SELECT *
FROM table_1 hapmap a JOIN tabl_2 b ON a.chrom = b.chrom
WHERE
a.start_pos >= b.start_pos
AND a.end_pos <= b.end_pos
LIMIT 10;
根据你所说的,这应该作为mapjoin执行,并且会更快。