我正在使用Cloudera的HBase-Spark连接器进行强化HBase或BigTable扫描。它工作正常,但是查看Spark的详细日志,看起来代码尝试重新建立与HBase的连接,每次调用都要通过{{Scan()
来处理JavaHBaseContext.foreachPartition()
的结果。 1}}。
我是否认为此代码每次都重新建立与HBase的连接?如果是这样,我如何重新编写它以确保重用已建立的连接?
以下是产生此行为的完整示例代码:
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.filter.FilterList;
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter;
import org.apache.hadoop.hbase.filter.PageFilter;
import org.apache.hadoop.hbase.filter.PrefixFilter;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.spark.JavaHBaseContext;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.VoidFunction;
import scala.Tuple2;
import java.util.Iterator;
public class Main
{
public static void main(String args[]) throws Exception
{
SparkConf sc = new SparkConf().setAppName(Main.class.toString()).setMaster("local");
Configuration hBaseConf = HBaseConfiguration.create();
Connection hBaseConn = ConnectionFactory.createConnection(hBaseConf);
JavaSparkContext jSPContext = new JavaSparkContext(sc);
JavaHBaseContext hBaseContext = new JavaHBaseContext(jSPContext, hBaseConf);
int numTries = 5;
byte rowKey[] = "ffec939d-bb21-4525-b1ff-f3143faae2".getBytes();
for(int i = 0; i < numTries; i++)
{
Scan s = new Scan(rowKey);
FilterList fList = new FilterList(FilterList.Operator.MUST_PASS_ALL);
fList.addFilter(new KeyOnlyFilter());
fList.addFilter(new FirstKeyOnlyFilter());
fList.addFilter(new PageFilter(5));
fList.addFilter(new PrefixFilter(rowKey));
s.setFilter(fList);
s.setCaching(5);
JavaRDD<Tuple2<ImmutableBytesWritable, Result>> scanRDD = hBaseContext
.hbaseRDD(hBaseConn.getTable(TableName.valueOf("FFUnits")).getName(), s);
hBaseContext.foreachPartition(scanRDD, new VoidFunction<Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection>>(){
private static final long serialVersionUID = 1L;
public void call(Tuple2<Iterator<Tuple2<ImmutableBytesWritable,Result>>, Connection> t) throws Exception{
while (t._1().hasNext())
System.out.println("\tCurrent row: " + new String(t._1().next()._1.get()));
}});
}
}
}
这是Spark Logs的输出。对于循环的每5次迭代,此输出重复5次:
18/03/26 15:51:56 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c5f
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c5f closed
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:56 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 2044 bytes result sent to driver
18/03/26 15:51:56 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 300 ms on localhost (1/1)
18/03/26 15:51:56 INFO scheduler.DAGScheduler: ResultStage 3 (foreachPartition at HBaseContext.scala:98) finished in 0.301 s
18/03/26 15:51:56 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool
18/03/26 15:51:56 INFO scheduler.DAGScheduler: Job 3 finished: foreachPartition at HBaseContext.scala:98, took 0.311925 s
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9 stored as values in memory (estimated size 266.5 KB, free 1391.1 KB)
18/03/26 15:51:56 INFO storage.MemoryStore: Block broadcast_9_piece0 stored as bytes in memory (estimated size 20.7 KB, free 1411.8 KB)
18/03/26 15:51:56 INFO storage.BlockManagerInfo: Added broadcast_9_piece0 in memory on localhost:57171 (size: 20.7 KB, free: 457.8 MB)
18/03/26 15:51:56 INFO spark.SparkContext: Created broadcast 9 from NewHadoopRDD at NewHBaseRDD.scala:25
18/03/26 15:51:56 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0xc412556 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:56 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@6f930e0
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:56 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c60, negotiated timeout = 90000
18/03/26 15:51:56 INFO util.RegionSizeCalculator: Calculating region sizes for table "FFUnits".
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing master protocol: MasterService
18/03/26 15:51:57 INFO client.ConnectionManager$HConnectionImplementation: Closing zookeeper sessionid=0x16261d615db0c60
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Session: 0x16261d615db0c60 closed
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: EventThread shut down
18/03/26 15:51:57 INFO spark.SparkContext: Starting job: foreachPartition at HBaseContext.scala:98
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Got job 4 (foreachPartition at HBaseContext.scala:98) with 1 output partitions
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Final stage: ResultStage 4 (foreachPartition at HBaseContext.scala:98)
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Parents of final stage: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Missing parents: List()
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427), which has no missing parents
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10 stored as values in memory (estimated size 2.9 KB, free 1414.7 KB)
18/03/26 15:51:57 INFO storage.MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 1719.0 B, free 1416.4 KB)
18/03/26 15:51:57 INFO storage.BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:57171 (size: 1719.0 B, free: 457.8 MB)
18/03/26 15:51:57 INFO spark.SparkContext: Created broadcast 10 from broadcast at DAGScheduler.scala:1006
18/03/26 15:51:57 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 4 (MapPartitionsRDD[9] at map at HBaseContext.scala:427)
18/03/26 15:51:57 INFO scheduler.TaskSchedulerImpl: Adding task set 4.0 with 1 tasks
18/03/26 15:51:57 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 4.0 (TID 4, localhost, partition 0,ANY, 2611 bytes)
18/03/26 15:51:57 INFO executor.Executor: Running task 0.0 in stage 4.0 (TID 4)
18/03/26 15:51:57 INFO spark.NewHBaseRDD: Input split: HBase table split(table name: FFUnits, scan: GiJmZmVjOTM5ZC1iYjIxLTQ1MjUtYjFmZi1mMzE0M2ZhYWUyKqECCilvcmcuYXBhY2hlLmhhZG9v
cC5oYmFzZS5maWx0ZXIuRmlsdGVyTGlzdBLzAQgBEjIKLG9yZy5hcGFjaGUuaGFkb29wLmhiYXNl
LmZpbHRlci5LZXlPbmx5RmlsdGVyEgIIABI1CjFvcmcuYXBhY2hlLmhhZG9vcC5oYmFzZS5maWx0
ZXIuRmlyc3RLZXlPbmx5RmlsdGVyEgASLwopb3JnLmFwYWNoZS5oYWRvb3AuaGJhc2UuZmlsdGVy
LlBhZ2VGaWx0ZXISAggFElMKK29yZy5hcGFjaGUuaGFkb29wLmhiYXNlLmZpbHRlci5QcmVmaXhG
aWx0ZXISJAoiZmZlYzkzOWQtYmIyMS00NTI1LWIxZmYtZjMxNDNmYWFlMjgBQAGIAQU=, start row: ffec939d-bb21-4525-b1ff-f3143faae2, end row: , region location: 144.240.189.35.bc.googleusercontent.com, encoded region name: 2bce3b6bf780755d19fc4b610b17cf11)
18/03/26 15:51:57 INFO zookeeper.RecoverableZooKeeper: Process identifier=hconnection-0x46ac4a0 connecting to ZooKeeper ensemble=hbase-3:2181
18/03/26 15:51:57 INFO zookeeper.ZooKeeper: Initiating client connection, connectString=hbase-3:2181 sessionTimeout=90000 watcher=org.apache.hadoop.hbase.zookeeper.PendingWatcher@5a8a2d2
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Opening socket connection to server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181. Will not attempt to authenticate using SASL (unknown error)
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Socket connection established to 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, initiating session
18/03/26 15:51:57 INFO zookeeper.ClientCnxn: Session establishment complete on server 144.240.189.35.bc.googleusercontent.com/35.189.240.144:2181, sessionid = 0x16261d615db0c61, negotiated timeout = 90000
18/03/26 15:51:57 INFO mapreduce.TableInputFormatBase: Input split length: 4 M bytes.
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0049424a-5cea-46cb-a6b0-7c50d6465588
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*0082054a-b86a-4263-9753-025c1b0607be
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*00e21835-5dc6-4d82-8b8c-a4dcae4f14cd
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*01129620-a599-4fb7-9e2f-3492df1d06a3
Current row: ffec939d-bb21-4525-b1ff-f3143faae246*1*035b3450-e523-4df6-a24f-11ebb29050f7
我的hbse-site.xml文件如下所示:
<configuration>
<property>
<name>hbase.zookeeper.quorum</name>
<value>hbase-3</value>
</property>
<property>
<name>hbase.zookeeper.property.clientPort</name>
<value>2181</value>
</property>
<property>
<name>timeout</name>
<value>5000</value>
</property>
</configuration>
我使用以下版本:
Spark v 1.6.2
HBase 1.3.1
Spark-HBase 1.2.0-cdh5.14.0
感谢您提供任何帮助和建议!
答案 0 :(得分:2)
这是一个常见问题。创建连接的成本可能使您正在进行的实际工作相形见绌。
在Cloud Bigtable中,您可以在配置设置中将google.bigtable.use.cached.data.channel.pool
设置为true
。这将显着提高性能。 Cloud Bigtable最终为您的所有Cloud Bigtable实例使用单个HTTP / 2端点。
我不知道HBase中有类似的构造,但是这样做的一种方法是建议创建Connection
的实现,以创建一个单独的缓存Connection
。您必须将hbase.client.connection.impl
设置为新班级。