火花流mongoDB自定义接收器

时间:2020-01-07 14:08:38

标签: apache-spark receiver

我正在编写一个自定义的Spark Streaming Mongo Receiver,以便使用spark Dstream从mongoDb集合中读取数据。

下面是我写的代码:

class MongoDBReceiver[D: ClassTag](mongoConnector: MongoDefaultConnector, 
                findOptions:    MongoOptions, 
                storageLevel: StorageLevel ) extends Receiver[D](storageLevel) {

  val logger =  LoggerFactory.getLogger(getClass)

  private var subscription: Option[Subscription] = None

  override def onStart(): Unit = {
    new Thread() {
      override def run() {
        logger.info("starting")
        receive()
      }
    }.start()

  }

  def receive(): Unit = {
    mongoConnector.getCollection[D]() match {
      case Success(collection) => {
        getPickObservable(collection, findOptions).snapshot(true).subscribe(
          new Observer[D] {
            override def onSubscribe(sub: Subscription): Unit = {
              subscription = Some(sub)
              sub.request(Long.MaxValue)
            }
            override def onNext(doc: D): Unit = store(doc)

            override def onError(throwable: Throwable): Unit = stop("Observable errored", throwable)

            override def onComplete(): Unit = stop("publisher finished")

          }
        )
      }
      case Failure(ex) => stop("Failed to connect to MongoDB", ex)
    }

  }

  override def onStop(): Unit = {
    logger.info("stopping")
  }
}

这有效,但是我让作业多次读取相同的文档,在日志之后,接收器会连续启动和停止,因此它一次又一次地重复相同的处理。以下是我得到的日志:

20/01/07 15:06:21 INFO MongoDBReceiver:从20/01/07 15:06:21开始 INFO群集:使用设置创建的群集 {hosts = [sitewhere-mongodb-rd.gfxiq.prv:27017],mode = SINGLE, requiredClusterType = UNKNOWN,serverSelectionTimeout ='30000 ms', maxWaitQueueSize = 500} 2007年1月20日15:06:21信息群集:未选择服务器 由com.mongodb.async.client.ClientSessionHelper$1@72859317来自 群集描述ClusterDescription {type = UNKNOWN, connectionMode = SINGLE, serverDescriptions = [ServerDescription {address = sitewhere-mongodb-rd.gfxiq.prv:27017, 类型=未知,状态=连接}]}。等待30000毫秒,然后再计时 out 20/01/07 15:06:21 INFO连接:打开的连接 [connectionId {localValue:65,serverValue:14408}]至 sitewhere-mongodb-rd.gfxiq.prv:27017 20/01/07 15:06:21 INFO群集: 监视线程成功连接到服务器的说明 ServerDescription {地址= sitewhere-mongodb-rd.gfxiq.prv:27017, 类型= STANDALONE,状态= CONNECTED,确定= true, 版本= ServerVersion {versionList = [3,4,9]},minWireVersion = 0, maxWireVersion = 5,maxDocumentSize = 16777216, logicalSessionTimeoutMinutes = null,roundTripTimeNanos = 12612869} 20/01/07 15:06:21 INFO连接:打开的连接 [connectionId {localValue:66,serverValue:14409}]至 sitewhere-mongodb-rd.gfxiq.prv:27017 20/01/07 15:06:21 INFO ReceiverSupervisorImpl:使用消息停止发布者:发布者 已完成:07/01/07 15:06:21 INFO MongoDBReceiver:正在停止20/01/07 15:06:21 INFO ReceiverSupervisorImpl:onStop称为接收器20/01/07 15:06:21 INFO ReceiverSupervisorImpl:注销接收器0 20/01/07 15:06:21错误ReceiverTracker:取消注册的接收器 流0:发布者完成了20/01/07 15:06:21 INFO ReceiverSupervisorImpl:停止接收器0

您知道如何解决此问题,以使连接器读取一次并在与mongoDB服务器的连接建立时保持连接状态。

0 个答案:

没有答案