我在Hadoop上运行的Storm拓扑配置为伪分布式模式。拓扑包含一个必须将数据写入Hbase的螺栓。
我的第一个用于测试目的的方法是在我的bolt execute
方法内创建(和关闭)连接和写入数据。但是看起来我的本地机器上没有那么多资源来处理所有进入HBase的请求。在大约30个成功处理请求后,我在Storm工作日志中看到以下内容:
o.a.z.ClientCnxn [INFO] Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
o.a.z.ClientCnxn [INFO] Socket connection established to localhost/127.0.0.1:2181, initiating session
o.a.z.ClientCnxn [INFO] Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
o.a.h.h.z.RecoverableZooKeeper [WARN] Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid
我的想法是通过为每个螺栓实例创建单个连接来减少与HBase的连接数 - 在prepare
方法中打开连接并在cleanup
关闭它。但是根据文档cleanup
,不能保证在分布式模式下调用。
在此之后,我发现了Storm与Hbase合作的框架 - storm-hbase 。不幸的是,几乎没有关于它的信息,只是README在它的github回购。
此外,我需要能够从HBase表中删除单元格。但我在storm-hbase doc中没有找到任何关于它的内容。
提前致谢!
答案 0 :(得分:2)
哦,小伙子,我的时间闪耀!我不得不从Storm那里做大量优化写入HBase,所以希望这对你有所帮助。
如果您刚入门storm-hbase是开始将数据流式传输到hbase的好方法。您可以克隆项目,进行maven安装,然后在拓扑中引用它。
然而,如果你开始得到更复杂的逻辑,那么创建自己的类来与HBase交谈可能就是要走的路。这就是我将在这里回答的问题。
我假设你正在使用maven和maven-shade插件。您需要引用hbase-client:
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>${hbase.version}</version>
</dependency>
还要确保在拓扑jar中打包hbase-site.xml
。您可以从群集中下载此文件,然后将其放入src/main/resources
。我还有一个用于在名为hbase-site.dev.xml
的开发中进行测试。然后只需使用阴影插件将其移动到jar的根部。
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.4</version>
<configuration>
<createDependencyReducedPom>true</createDependencyReducedPom>
<artifactSet>
<excludes>
<exclude>classworlds:classworlds</exclude>
<exclude>junit:junit</exclude>
<exclude>jmock:*</exclude>
<exclude>*:xml-apis</exclude>
<exclude>org.apache.maven:lib:tests</exclude>
<exclude>log4j:log4j:jar:</exclude>
<exclude>org.testng:testng</exclude>
</excludes>
</artifactSet>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.IncludeResourceTransformer">
<resource>core-site.xml</resource>
<file>src/main/resources/core-site.xml</file>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.IncludeResourceTransformer">
<resource>hbase-site.xml</resource>
<file>src/main/resources/hbase-site.xml</file>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.IncludeResourceTransformer">
<resource>hdfs-site.xml</resource>
<file>src/main/resources/hdfs-site.xml</file>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass></mainClass>
</transformer>
</transformers>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
<exclude>junit/*</exclude>
<exclude>webapps/</exclude>
<exclude>testng*</exclude>
<exclude>*.js</exclude>
<exclude>*.png</exclude>
<exclude>*.css</exclude>
<exclude>*.json</exclude>
<exclude>*.csv</exclude>
</excludes>
</filter>
</filters>
</configuration>
</execution>
</executions>
</plugin>
注意:我在其中使用其他配置的线路,如果您不需要它们,请将其移除。顺便说一句,我并不喜欢打包这样的配置,但是这样可以更容易地设置HBase连接并解决一堆奇怪的连接错误。
3/19/2018更新:自从我写这篇答案以来,HBase的API发生了重大变化,但概念是相同的。
最重要的是在prepare
方法中为每个螺栓实例创建一个HConnection ,然后重新使用该连接螺栓的寿命!
Configuration config = HBaseConfiguration.create();
connection = HConnectionManager.createConnection(config);
首先,您可以将单个PUT转换为HBase。您可以通过这种方式打开/关闭表格。
// single put method
private HConnection connection;
@SuppressWarnings("rawtypes")
@Override
public void prepare(java.util.Map stormConf, backtype.storm.task.TopologyContext context) {
Configuration config = HBaseConfiguration.create();
connection = HConnectionManager.createConnection(config);
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
try {
// do stuff
// call putFruit
} catch (Exception e) {
LOG.error("bolt error", e);
collector.reportError(e);
}
}
// example put method you'd call from within execute somewhere
private void putFruit(String key, FruitResult data) throws IOException {
HTableInterface table = connection.getTable(Constants.TABLE_FRUIT);
try {
Put p = new Put(key.getBytes());
long ts = data.getTimestamp();
p.add(Constants.FRUIT_FAMILY, Constants.COLOR, ts, data.getColor().getBytes());
p.add(Constants.FRUIT_FAMILY, Constants.SIZE, ts, data.getSize().getBytes());
p.add(Constants.FRUIT_FAMILY, Constants.WEIGHT, ts, Bytes.toBytes(data.getWeight()));
table.put(p);
} finally {
try {
table.close();
} finally {
// nothing
}
}
}
注意我在这里重新使用连接。我建议从这里开始,因为这样更容易工作和调试。最终由于您尝试通过网络发送的请求数量而无法进行扩展,您需要开始将多个PUT批处理。
为了批量处理PUT,您需要使用HConnection打开一个表并保持打开状态。您还需要将Auto Flush设置为false。这意味着该表将自动缓冲请求,直到它到达&#34; hbase.client.write.buffer&#34;大小(默认为2097152)。
// batch put method
private static boolean AUTO_FLUSH = false;
private static boolean CLEAR_BUFFER_ON_FAIL = false;
private HConnection connection;
private HTableInterface fruitTable;
@SuppressWarnings("rawtypes")
@Override
public void prepare(java.util.Map stormConf, backtype.storm.task.TopologyContext context) {
Configuration config = HBaseConfiguration.create();
connection = HConnectionManager.createConnection(config);
fruitTable = connection.getTable(Constants.TABLE_FRUIT);
fruitTable.setAutoFlush(AUTO_FLUSH, CLEAR_BUFFER_ON_FAIL);
}
@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
try {
// do stuff
// call putFruit
} catch (Exception e) {
LOG.error("bolt error", e);
collector.reportError(e);
}
}
// example put method you'd call from within execute somewhere
private void putFruit(String key, FruitResult data) throws IOException {
Put p = new Put(key.getBytes());
long ts = data.getTimestamp();
p.add(Constants.FRUIT_FAMILY, Constants.COLOR, ts, data.getColor().getBytes());
p.add(Constants.FRUIT_FAMILY, Constants.SIZE, ts, data.getSize().getBytes());
p.add(Constants.FRUIT_FAMILY, Constants.WEIGHT, ts, Bytes.toBytes(data.getWeight()));
fruitTable.put(p);
}
在任何一种方法中,最好还是尝试在cleanup
中关闭HBase连接。请注意,在您的工人被杀之前可能无法调用它。
new Delete(key);
而非Put。如果您有更多问题,请与我们联系。
答案 1 :(得分:1)
例如,您可以使用&#34;发布商&#34;线程?
这是:有一个单独的类作为线程运行,它将为您执行hbase / mysql / elasticsearch / hdfs / etc ...的请求。出于性能原因,应该分批进行。
有一个全局列表来处理并发操作和执行程序服务:
begin tran
Create table #temp (userid int, username varchar(50), groupname varchar(50))
insert into #temp(userid , username , groupname)
select 1, 'sankar', 'GROUPLG'
union all
select 1, 'sankar', 'GROUPLS'
union all
select 1, 'sankar', 'GROUPNG'
union all
select 1, 'sankar', 'GROUPNS'
union all
select 2, 'Srini', 'HYDRSPMLG'
union all
select 2, 'Srini', 'HYDRSPMLS'
union all
select 3, 'Ravi', 'AADSCLS'
union all
select 4, 'Arun', 'RREDFTLS'
union all
select 4, 'Arun', 'RREDFTNG'
union all
select 5, 'Raja', '1234567'
union all
select 5, 'Raja', 'ABCDESLS'
union all
select 5, 'Raja', 'ABCDESLG'
union all
select 6, 'Dhilip', 'GGGGRASCDW_RV'
union all
select 6, 'Dhilip', 'CDW_RV'
union all
select 6, 'Dhilip', 'GFNG'
union all
select 6, 'Dhilip', 'GFNS'
union all
select 7, 'Satya', '184518451845'
select * from #temp
select tp.userid , tp.username, groupname + CASE WHEN tp.flag = 1 THEN CASE WHEN ct.cnt > 1 then ' (' else '' end +
ISNULL(pt.grouptype1,'')+case when grouptype2 is not null
and grouptype1 is not null then ',' else '' end +
ISNULL(pt.grouptype2,'')+case when grouptype3 is not null
and (grouptype1 is not null
or grouptype2 is not null ) then ',' else '' end +
ISNULL(pt.grouptype3,'')+case when grouptype4 is not null
and (grouptype1 is not null
or grouptype2 is not null
or grouptype3 is not null) then ',' else '' end +
ISNULL(pt.grouptype4,'') + case when ct.cnt > 1 then ')' else '' end
ELSE ''
END as Permission
from (SELECT distinct userid , username, CASE WHEN RIGHT(groupname,2) IN ('LG','LS','NG','NS') THEN Substring(groupname,1,len(groupname)-2)
ELSE groupname END as groupname ,
CASE WHEN RIGHT(groupname,2) IN ('LG','LS','NG','NS') THEN 1
ELSE 0 END as flag from #temp ) tp
--WHERE Substring(groupname,1,len(groupname)-2) IN ('LG','LS','NG','NS')
join (select userid , [LG] as grouptype1 , [LS] as grouptype2 , [NG] as grouptype3 ,
[NS] as grouptype4
FROM (SELECT userid , RIGHT(groupname,2) as grouptype FROM #temp) as Sourcetable
PIVOT (MAX(grouptype)
for grouptype in ([LG],[LS],[NG],[NS])) As Pivottable) pt
ON tp.userid = pt.userid
join (select userid, count(*) as cnt from #temp group by userid ) ct
on ct.userid = tp.userid
DROP TABLE #temp
-- Expected Output
-- 1 Sankar GROUP(LG,LS,NG,NS)
-- 2 Srini HYDRSPM(LG,LS)
-- 3 Ravi AADSCLS
-- 4 Arun RREDFT(LS,NG)
-- 5 Raja 1234567
-- 5 Raja ABCDESLG(LG,LS)
-- 6 dhilip GGGGRASCDW_RV
rollback
有一个会为你插入文档的线程类
private transient BlockingQueue<Tuple> insertQueue;
private transient ExecutorService theExecutor;
private transient Future<?> publisherFuture;
初始化线程类和prepare methood中的列表
private class Publisher implements Runnable {
@Override
public void run() {
long sendBatchTs = System.currentTimeMillis();
while (true){
if(insertQueue.size >100){ // 100 tuples per batch
List<Tuple> batchQueue = new ArrayList<>(100);
insertQueue.drainTo(batchQueue, 100);
// write code to insert the 100 documents
sendBatchTs = System.currentTimeMillis();
}
else if (System.currentTimeMillis() - sendBatchTs > 5000){
// to prevent tuple timeout
int listSize = batchQueue.size();
List<Tuple> batchQueue = new ArrayList<>(listSize);
insertQueue.drainTo(batchQueue, listSize);
// write code to insert the 100 documents
sendBatchTs = System.currentTimeMillis();
}
}
// your code
}
}
关闭清理时的连接
@Override
public void prepare (final Map _conf, final TopologyContext _context , final OutputCollector _collector) {
// open your connection
insertQueue = new LinkedBlockingQueue<>();
theExecutor = Executors.newSingleThreadExecutor();
publisherFuture = theExecutor.submit(new Publisher());
}
在执行方法
中收集元组@Override
public void cleanup() {
super.cleanup();
theExecutor.shutdown();
publisherFuture.cancel(true);
// close your connection
}