我是Hadoop和HBase的初学者,我正在学习如何使用' importtsv'将大型海量数据(存储在HDFS中的8GB tsv文件)导入HBase。但是,mapreduce工作似乎很慢,经过很长一段时间,它失败了。也许文件太大而且会导致集群崩溃。当我改为使用一个小的tsv文件时,它运行良好。那么如果我坚持导入这么大的文件,如何在我的情况下加快mapreduce工作呢? Hadoop中是否有任何缓存配置可以帮助这个?
我有一个macOS namenode和两个Ubuntu datanode。
导入命令:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.columns=HBASE_ROW_KEY,info:action,timestamp,info:name,info:bank,info:account records /user/root
错误信息:
2017-03-20 16:48:27,136 INFO [main] zookeeper.ZooKeeper: Client environment:java.library.path=/usr/local/hadoop/lib/native
2017-03-20 16:48:27,136 INFO [main] zookeeper.ZooKeeper: Client environment:java.io.tmpdir=/var/folders/nl/f3lktfgn7jg46jycx21cxfmr0000gn/T/
2017-03-20 16:48:27,136 INFO [main] zookeeper.ZooKeeper: Client environment:java.compiler=<NA>
2017-03-20 16:48:27,137 INFO [main] zookeeper.ZooKeeper: Client environment:os.name=Mac OS X
2017-03-20 16:48:27,137 INFO [main] zookeeper.ZooKeeper: Client environment:os.arch=x86_64
2017-03-20 16:48:27,137 INFO [main] zookeeper.ZooKeeper: Client environment:os.version=10.12.3
2017-03-20 16:48:27,137 INFO [main] zookeeper.ZooKeeper: Client environment:user.name=haohui
2017-03-20 16:48:27,137 INFO [main] zookeeper.ZooKeeper: Client environment:user.home=/Users/haohui
2017-03-20 16:48:27,138 INFO [main] zookeeper.ZooKeeper: Client environment:user.dir=/Users/haohui
2017-03-20 16:48:27,138 INFO [main] zookeeper.ZooKeeper: Initiating client connection, connectString=master:2181,node1:2181,node2:2181 sessionTimeout=30000 watcher=hconnection-0x3fc2959f0x0, quorum=master:2181,node1:2181,node2:2181, baseZNode=/hbase
2017-03-20 16:48:27,157 INFO [main-SendThread(master:2181)] zookeeper.ClientCnxn: Opening socket connection to server master/10.211.55.2:2181. Will not attempt to authenticate using SASL (unknown error)
2017-03-20 16:48:27,188 INFO [main-SendThread(master:2181)] zookeeper.ClientCnxn: Socket connection established to master/10.211.55.2:2181, initiating session
2017-03-20 16:48:27,200 INFO [main-SendThread(master:2181)] zookeeper.ClientCnxn: Session establishment complete on server master/10.211.55.2:2181, sessionid = 0x15aeae6867a0001, negotiated timeout = 30000
2017-03-20 16:48:56,396 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-03-20 16:48:56,441 INFO [main] client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x15aeae6867a0001
2017-03-20 16:48:56,450 INFO [main] zookeeper.ZooKeeper: Session: 0x15aeae6867a0001 closed
2017-03-20 16:48:56,450 INFO [main-EventThread] zookeeper.ClientCnxn: EventThread shut down
2017-03-20 16:48:56,524 INFO [main] client.RMProxy: Connecting to ResourceManager at master/10.211.55.2:8032
2017-03-20 16:48:56,666 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-03-20 16:48:58,818 INFO [main] input.FileInputFormat: Total input paths to process : 1
2017-03-20 16:48:58,873 INFO [main] mapreduce.JobSubmitter: number of splits:56
2017-03-20 16:48:58,884 INFO [main] Configuration.deprecation: io.bytes.per.checksum is deprecated. Instead, use dfs.bytes-per-checksum
2017-03-20 16:48:59,006 INFO [main] mapreduce.JobSubmitter: Submitting tokens for job: job_1489999688045_0001
2017-03-20 16:48:59,319 INFO [main] impl.YarnClientImpl: Submitted application application_1489999688045_0001
2017-03-20 16:48:59,370 INFO [main] mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1489999688045_0001/
2017-03-20 16:48:59,371 INFO [main] mapreduce.Job: Running job: job_1489999688045_0001
2017-03-20 16:49:09,668 INFO [main] mapreduce.Job: Job job_1489999688045_0001 running in uber mode : false
2017-03-20 16:49:09,670 INFO [main] mapreduce.Job: map 0% reduce 0%
2017-03-20 17:00:09,103 INFO [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000009_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000009_0 Timed out after 600 secs
2017-03-20 17:00:09,127 INFO [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000011_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000011_0 Timed out after 600 secs
2017-03-20 17:00:09,128 INFO [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000010_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000010_0 Timed out after 600 secs
2017-03-20 17:00:09,129 INFO [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000013_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000013_0 Timed out after 600 secs
2017-03-20 17:00:09,130 INFO [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000008_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000008_0 Timed out after 600 secs
2017-03-20 17:00:09,131 INFO [main] mapreduce.Job: Task Id : attempt_1489999688045_0001_m_000012_0, Status : FAILED
AttemptID:attempt_1489999688045_0001_m_000012_0 Timed out after 600 secs
答案 0 :(得分:0)
我不确定如何加快操作速度,因为它实际上取决于您的架构和数据。您可以在this article中获得有关行的最佳设计的一些信息。 至于崩溃,您的作业可能会因批量步骤中的长时间运行而导致超时异常,而不会在 ImportTsv 实用程序计划的mapreduce作业中将进度报告回YARN。您可以在 mapred-site.xml 文件中增加超时:
<property>
<name>mapred.task.timeout</name>
<value>2000000</value> <!-- A value 2000000 equals 2000 secs -->
</property>
或者您可以将其设置为0,这将禁用作业的超时,但这被认为是一种不好的做法,因为您将面临处理群集中潜在僵尸的风险。
答案 1 :(得分:0)
好吧,将mapred.task.timeout
设置为更大的值肯定可以帮助避免超时,但仍需要等待很长时间才能运行。我终于找到了一种更有效的方法来加速mapreduce进程并避免崩溃,增加所有节点的内存和CPU资源:
添加到yarn-site.xml:
<property>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>4096</value>
</property>
<property>
<name>yarn.nodemanager.resource.cpu-vcores</name>
<value>2</value>
</property>
<property>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>4096</value>
</property>
添加到mapred.xml:
<property>
<name>yarn.app.mapreduce.am.resource.mb</name>>
<value>4096</value>>
</property>>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>>
<value>-Xmx3768m</value>>
</property>>
<property>
<name>mapreduce.map.cpu.vcores</name>>
<value>2</value>>
</property>>
<property>
<name>mapreduce.reduce.cpu.vcores</name>>
<value>2</value>>
</property>>