我目前正在编写一个Python脚本,使用" happybase"将HBase表转换为csv。我遇到的问题是,如果表格太大,我会在达到200多万行之后得到以下错误:
Hbase_thrift.IOError: IOError(message='org.apache.hadoop.hbase.DoNotRetryIOException: hconnection-0x8dfa2f2 closed\n\tat org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1182)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)\n\tat org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)\n\tat org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:212)\n\tat org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:314)\n\tat org.apache.hadoop.hbase.client.ClientScanner.loadCache(ClientScanner.java:432)\n\tat org.apache.hadoop.hbase.client.ClientScanner.next(ClientScanner.java:358)\n\tat org.apache.hadoop.hbase.client.AbstractClientScanner.next(AbstractClientScanner.java:70)\n\tat org.apache.hadoop.hbase.thrift.ThriftServerRunner$HBaseHandler.scannerGetList(ThriftServerRunner.java:1423)\n\tat sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)\n\tat sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.lang.reflect.Method.invoke(Method.java:498)\n\tat org.apache.hadoop.hbase.thrift.HbaseHandlerMetricsProxy.invoke(HbaseHandlerMetricsProxy.java:67)\n\tat com.sun.proxy.$Proxy10.scannerGetList(Unknown Source)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4789)\n\tat org.apache.hadoop.hbase.thrift.generated.Hbase$Processor$scannerGetList.getResult(Hbase.java:4773)\n\tat org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)\n\tat org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)\n\tat org.apache.hadoop.hbase.thrift.TBoundedThreadPoolServer$ClientConnnection.run(TBoundedThreadPoolServer.java:289)\n\tat org.apache.hadoop.hbase.thrift.CallQueue$Call.run(CallQueue.java:64)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n')
我的目的是将for循环切换成子循环(即打开Hbase连接 - >获取前100,000行的数据 - >关闭连接 - >再次重新打开它 - >获取接下来的100,000行 - >关闭它......依此类推),但我似乎无法理解如何做到这一点。 以下是我的代码示例,其中包含所有行和崩溃:
import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)
for row in table_object.scan():
print row
任何帮助都将受到赞赏(即使您建议另一种解决方案:))
由于
答案 0 :(得分:0)
实际上,我找到了做到这一点的方法,并且如下:
import happybase
connection = happybase.Connection('localhost')
table = 'some_table'
table_object = connection.table(table)
while True:
try:
for row in table_object.scan():
print row
break
except Exception as e:
if "org.apache.hadoop.hbase.DoNotRetryIOException" in e.message:
connection.open()
else:
print e
quit()