从Apache Storm bolt

时间:2015-08-20 20:52:03

标签: hbase apache-storm

我在Hadoop上运行的Storm拓扑配置为伪分布式模式。拓扑包含一个必须将数据写入Hbase的螺栓。 我的第一个用于测试目的的方法是在我的bolt execute方法内创建(和关闭)连接和写入数据。但是看起来我的本地机器上没有那么多资源来处理所有进入HBase的请求。在大约30个成功处理请求后,我在Storm工作日志中看到以下内容:

o.a.z.ClientCnxn [INFO] Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
o.a.z.ClientCnxn [INFO] Socket connection established to localhost/127.0.0.1:2181, initiating session
o.a.z.ClientCnxn [INFO] Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
o.a.h.h.z.RecoverableZooKeeper [WARN] Possibly transient ZooKeeper, quorum=localhost:2181, exception=org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/hbaseid

我的想法是通过为每个螺栓实例创建单个连接来减少与HBase的连接数 - 在prepare方法中打开连接并在cleanup关闭它。但是根据文档cleanup,不能保证在分布式模式下调用。

在此之后,我发现了Storm与Hbase合作的框架 - storm-hbase 。不幸的是,几乎没有关于它的信息,只是README在它的github回购。

  1. 所以我的第一个问题是使用storm-hbase for Storm-Hbase 整合是好的解决方案?什么是最好的方法呢?
  2. 此外,我需要能够从HBase表中删除单元格。但我在storm-hbase doc中没有找到任何关于它的内容。

    1. 有没有可能用storm-hbase做到这一点?或者回到 上一个问题,还有另一种方法可以做到这一切吗?
    2. 提前致谢!

2 个答案:

答案 0 :(得分:2)

哦,小伙子,我的时间闪耀!我不得不从Storm那里做大量优化写入HBase,所以希望这对你有所帮助。

如果您刚入门storm-hbase是开始将数据流式传输到hbase的好方法。您可以克隆项目,进行maven安装,然后在拓扑中引用它。

然而,如果你开始得到更复杂的逻辑,那么创建自己的类来与HBase交谈可能就是要走的路。这就是我将在这里回答的问题。

项目设置

我假设你正在使用maven和maven-shade插件。您需要引用hbase-client:

<dependency>
   <groupId>org.apache.hbase</groupId>
   <artifactId>hbase-client</artifactId>
   <version>${hbase.version}</version>
</dependency>

还要确保在拓扑jar中打包hbase-site.xml。您可以从群集中下载此文件,然后将其放入src/main/resources。我还有一个用于在名为hbase-site.dev.xml的开发中进行测试。然后只需使用阴影插件将其移动到jar的根部。

<plugin>
   <groupId>org.apache.maven.plugins</groupId>
   <artifactId>maven-shade-plugin</artifactId>
   <version>2.4</version>
   <configuration>
      <createDependencyReducedPom>true</createDependencyReducedPom>
      <artifactSet>
         <excludes>
            <exclude>classworlds:classworlds</exclude>
            <exclude>junit:junit</exclude>
            <exclude>jmock:*</exclude>
            <exclude>*:xml-apis</exclude>
            <exclude>org.apache.maven:lib:tests</exclude>
            <exclude>log4j:log4j:jar:</exclude>
            <exclude>org.testng:testng</exclude>
         </excludes>
      </artifactSet>
   </configuration>
   <executions>
      <execution>
         <phase>package</phase>
         <goals>
            <goal>shade</goal>
         </goals>
         <configuration>
            <transformers>
               <transformer implementation="org.apache.maven.plugins.shade.resource.IncludeResourceTransformer">
                       <resource>core-site.xml</resource>
                       <file>src/main/resources/core-site.xml</file>
                   </transformer>
                   <transformer implementation="org.apache.maven.plugins.shade.resource.IncludeResourceTransformer">
                       <resource>hbase-site.xml</resource>
                       <file>src/main/resources/hbase-site.xml</file>
                   </transformer>
                   <transformer implementation="org.apache.maven.plugins.shade.resource.IncludeResourceTransformer">
                       <resource>hdfs-site.xml</resource>
                       <file>src/main/resources/hdfs-site.xml</file>
                   </transformer>
               <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
               <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                  <mainClass></mainClass>
               </transformer>
            </transformers>
            <filters>
               <filter>
                  <artifact>*:*</artifact>
                  <excludes>
                     <exclude>META-INF/*.SF</exclude>
                     <exclude>META-INF/*.DSA</exclude>
                     <exclude>META-INF/*.RSA</exclude>
                     <exclude>junit/*</exclude>
                     <exclude>webapps/</exclude>
                     <exclude>testng*</exclude>
                     <exclude>*.js</exclude>
                     <exclude>*.png</exclude>
                     <exclude>*.css</exclude>
                     <exclude>*.json</exclude>
                     <exclude>*.csv</exclude>
                  </excludes>
               </filter>
            </filters>
         </configuration>
      </execution>
   </executions>
</plugin>

注意:我在其中使用其他配置的线路,如果您不需要它们,请将其移除。顺便说一句,我并不喜欢打包这样的配置,但是这样可以更容易地设置HBase连接并解决一堆奇怪的连接错误。

在Storm中管理HBase连接

3/19/2018更新:自从我写这篇答案以来,HBase的API发生了重大变化,但概念是相同的。

最重要的是在prepare方法中为每个螺栓实例创建一个HConnection ,然后重新使用该连接螺栓的寿命!

Configuration config = HBaseConfiguration.create();
connection = HConnectionManager.createConnection(config);

首先,您可以将单个PUT转换为HBase。您可以通过这种方式打开/关闭表格。

// single put method
private HConnection connection;

@SuppressWarnings("rawtypes")
@Override
public void prepare(java.util.Map stormConf, backtype.storm.task.TopologyContext context) {
   Configuration config = HBaseConfiguration.create();
   connection = HConnectionManager.createConnection(config);
}

@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
   try {
      // do stuff
      // call putFruit
   } catch (Exception e) {
      LOG.error("bolt error", e);
      collector.reportError(e);
   }
}

// example put method you'd call from within execute somewhere
private void putFruit(String key, FruitResult data) throws IOException {
   HTableInterface table = connection.getTable(Constants.TABLE_FRUIT);
   try {
     Put p = new Put(key.getBytes());
        long ts = data.getTimestamp();
        p.add(Constants.FRUIT_FAMILY, Constants.COLOR, ts, data.getColor().getBytes());
        p.add(Constants.FRUIT_FAMILY, Constants.SIZE, ts, data.getSize().getBytes());
        p.add(Constants.FRUIT_FAMILY, Constants.WEIGHT, ts, Bytes.toBytes(data.getWeight()));
        table.put(p);
   } finally {
      try {
         table.close();
      } finally {
         // nothing
      }
   }
}

注意我在这里重新使用连接。我建议从这里开始,因为这样更容易工作和调试。最终由于您尝试通过网络发送的请求数量而无法进行扩展,您需要开始将多个PUT批处理。

为了批量处理PUT,您需要使用HConnection打开一个表并保持打开状态。您还需要将Auto Flush设置为false。这意味着该表将自动缓冲请求,直到它到达&#34; hbase.client.write.buffer&#34;大小(默认为2097152)。

// batch put method
private static boolean AUTO_FLUSH = false;
private static boolean CLEAR_BUFFER_ON_FAIL = false;
private HConnection connection;
private HTableInterface fruitTable;

@SuppressWarnings("rawtypes")
@Override
public void prepare(java.util.Map stormConf, backtype.storm.task.TopologyContext context) {
   Configuration config = HBaseConfiguration.create();
   connection = HConnectionManager.createConnection(config);
   fruitTable = connection.getTable(Constants.TABLE_FRUIT);
   fruitTable.setAutoFlush(AUTO_FLUSH, CLEAR_BUFFER_ON_FAIL);
}

@Override
public void execute(Tuple tuple, BasicOutputCollector collector) {
   try {
      // do stuff
      // call putFruit
   } catch (Exception e) {
      LOG.error("bolt error", e);
      collector.reportError(e);
   }
}

// example put method you'd call from within execute somewhere
private void putFruit(String key, FruitResult data) throws IOException {
   Put p = new Put(key.getBytes());
   long ts = data.getTimestamp();
   p.add(Constants.FRUIT_FAMILY, Constants.COLOR, ts, data.getColor().getBytes());
   p.add(Constants.FRUIT_FAMILY, Constants.SIZE, ts, data.getSize().getBytes());
   p.add(Constants.FRUIT_FAMILY, Constants.WEIGHT, ts, Bytes.toBytes(data.getWeight()));
   fruitTable.put(p);
}

在任何一种方法中,最好还是尝试在cleanup中关闭HBase连接。请注意,在您的工人被杀之前可能无法调用它。

其他东西

  • 要进行删除,只需执行new Delete(key);而非Put。

如果您有更多问题,请与我们联系。

答案 1 :(得分:1)

例如,您可以使用&#34;发布商&#34;线程?

这是:有一个单独的类作为线程运行,它将为您执行hbase / mysql / elasticsearch / hdfs / etc ...的请求。出于性能原因,应该分批进行。

  1. 有一个全局列表来处理并发操作和执行程序服务:

    begin tran
    
    Create table #temp (userid int, username varchar(50), groupname varchar(50))
    
    insert into #temp(userid , username , groupname)
    select 1, 'sankar', 'GROUPLG'
    union all 
    select 1, 'sankar', 'GROUPLS'
    union all 
    select 1, 'sankar', 'GROUPNG'
    union all 
    select 1, 'sankar', 'GROUPNS'
    union all
    select 2, 'Srini', 'HYDRSPMLG'
    union all 
    select 2, 'Srini', 'HYDRSPMLS'
    union all 
    select 3, 'Ravi', 'AADSCLS'
    union all
    select 4, 'Arun',  'RREDFTLS'
    union all
    select 4, 'Arun',  'RREDFTNG'
    union all
    select 5, 'Raja',  '1234567'
    union all
    select 5, 'Raja',  'ABCDESLS'
    union all
    select 5, 'Raja',  'ABCDESLG'
    union all
    select 6, 'Dhilip',  'GGGGRASCDW_RV'
    union all
    select 6, 'Dhilip',  'CDW_RV'
    union all
    select 6, 'Dhilip',  'GFNG'
    union all
    select 6, 'Dhilip',  'GFNS'
    union all
    select 7, 'Satya',    '184518451845'
    
    select * from #temp
    
    select tp.userid , tp.username, groupname + CASE WHEN tp.flag = 1 THEN  CASE WHEN ct.cnt > 1 then ' (' else '' end + 								
    																			ISNULL(pt.grouptype1,'')+case when grouptype2 is not null 
    																									      and grouptype1 is not null then ',' else '' end +
    																			ISNULL(pt.grouptype2,'')+case when grouptype3 is not null 
    																										  and (grouptype1 is not null 
    																										  or   grouptype2 is not null ) then ',' else '' end +
    																			ISNULL(pt.grouptype3,'')+case when grouptype4 is not null 
    																										  and (grouptype1 is not null
    																										  or grouptype2 is not null
    																										  or grouptype3 is not null) then ',' else '' end +
    																			ISNULL(pt.grouptype4,'') + case when ct.cnt > 1 then ')' else '' end
    											ELSE ''
    											END as Permission
    
    from (SELECT distinct userid , username, CASE WHEN RIGHT(groupname,2) IN ('LG','LS','NG','NS') THEN Substring(groupname,1,len(groupname)-2)
    											  ELSE groupname END as groupname ,
    											  CASE WHEN RIGHT(groupname,2) IN ('LG','LS','NG','NS') THEN 1
    											  ELSE 0 END as flag from #temp ) tp
    
    --WHERE Substring(groupname,1,len(groupname)-2) IN ('LG','LS','NG','NS')
    
    join (select userid , [LG] as grouptype1 , [LS] as grouptype2 , [NG] as grouptype3 , 
    [NS] as grouptype4
    FROM (SELECT userid , RIGHT(groupname,2) as grouptype FROM #temp) as Sourcetable
    
    PIVOT (MAX(grouptype)
    for grouptype in ([LG],[LS],[NG],[NS])) As Pivottable) pt
    ON tp.userid	=	pt.userid
    join (select userid, count(*) as cnt from #temp group by userid ) ct
    on ct.userid = tp.userid
    
    DROP TABLE #temp
    
    -- Expected Output
     -- 1 Sankar GROUP(LG,LS,NG,NS)
     -- 2 Srini  HYDRSPM(LG,LS)
     -- 3 Ravi   AADSCLS
     -- 4 Arun   RREDFT(LS,NG)
     -- 5 Raja   1234567
     -- 5 Raja   ABCDESLG(LG,LS)
     -- 6 dhilip GGGGRASCDW_RV
    
    rollback
  2. 有一个会为你插入文档的线程类

    private transient BlockingQueue<Tuple> insertQueue;
    private transient ExecutorService theExecutor;
    private transient Future<?> publisherFuture;
    
  3. 初始化线程类和prepare methood中的列表

    private class Publisher implements Runnable {
    
    @Override
    public void run() {
               long sendBatchTs = System.currentTimeMillis();
    
              while (true){
    
                  if(insertQueue.size >100){ // 100 tuples per batch
                         List<Tuple> batchQueue = new ArrayList<>(100);
                         insertQueue.drainTo(batchQueue, 100);
                         // write code to insert the 100 documents
                        sendBatchTs = System.currentTimeMillis();
                  }
                  else if (System.currentTimeMillis() - sendBatchTs > 5000){
                  // to prevent tuple timeout
                         int listSize = batchQueue.size();
    
                          List<Tuple> batchQueue = new ArrayList<>(listSize);
                         insertQueue.drainTo(batchQueue, listSize);
                         // write code to insert the 100 documents
                        sendBatchTs = System.currentTimeMillis();
                  }
    
    
              }
    
    
     // your code
    }
    }
    
  4. 关闭清理时的连接

      @Override
      public void prepare (final Map _conf, final TopologyContext _context , final OutputCollector _collector) {
    
    // open your connection
    
       insertQueue = new LinkedBlockingQueue<>();
       theExecutor = Executors.newSingleThreadExecutor();
       publisherFuture = theExecutor.submit(new Publisher());
    }
    
  5. 在执行方法

    中收集元组
    @Override
    public void cleanup() {
       super.cleanup();
    
       theExecutor.shutdown();
       publisherFuture.cancel(true);
       // close your connection
     }