Flink自定义SourceFunction

时间:2017-09-15 07:57:11

标签: scala apache-flink

我想创建一个读取http流的SourceFunction。 我使用了ScalaJ,它可以实现我想要的功能(它通过\ n-s分割传入的文本)。 显然代码在Flink之外工作,但是每次我作为Flink作业启动它时都会得到一个NullPointerExcetion(有时会在发送1-2个元素后的1-2秒之后立即执行)。看起来像Http对象有一些问题。

import org.apache.flink.streaming.api.functions.source.SourceFunction

import scala.io.Source.fromInputStream
import scalaj.http._

class HttpSource(url: String) extends SourceFunction[String] {

  @volatile var isRunning = true
  override def cancel(): Unit = isRunning = false

  override def run(ctx: SourceFunction.SourceContext[String]): Unit =
    httpStream(ctx.collect)

  private def httpStream(f: String => Unit) = {
    val request = Http(url)
    request
      .execute { inputStream =>
        fromInputStream(inputStream)
          .getLines()
          .takeWhile(_ => isRunning)
          .foreach(f)
      }
  }
}

这是我通常得到的例外情况: (有时它有点不同,例如我试图使请求值瞬态,然后当它试图引用请求时它已经为空)

Caused by: java.lang.NullPointerException
    at java.io.Reader.<init>(Reader.java:78)
    at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
    at scala.io.BufferedSource.reader(BufferedSource.scala:24)
    at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:25)
    at scala.io.BufferedSource.scala$io$BufferedSource$$charReader$lzycompute(BufferedSource.scala:35)
    at scala.io.BufferedSource.scala$io$BufferedSource$$charReader(BufferedSource.scala:33)
    at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:62)
    at scala.io.BufferedSource$BufferedLineIterator.<init>(BufferedSource.scala:67)
    at scala.io.BufferedSource.getLines(BufferedSource.scala:86)
    at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:21)
    at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:19)
    at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
    at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
    at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:388)
    at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:380)
    at scala.Option.getOrElse(Option.scala:121)
    at scalaj.http.HttpRequest.toResponse(Http.scala:380)
    at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:360)
    at scalaj.http.HttpRequest.exec(Http.scala:335)
    at scalaj.http.HttpRequest.execute(Http.scala:323)
    at flinkextension.HttpSource.httpStream(HttpSource.scala:19)
    at flinkextension.HttpSource.run(HttpSource.scala:14)
    at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
    at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
    at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
    at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
    at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
    at java.lang.Thread.run(Thread.java:748)

当我不使用http请求时,其他所有内容似乎都正常工作,但是其他类似于使用相同InputStream类型的文件读取,只是带有字符串的简单while循环,甚至当我使用单个http请求时没有流媒体。

我觉得我缺少一些理论背景,也许flink在后台做了一些破坏Http对象或InputStream的东西,但我没有在文档中找到任何东西。

更新#1:

如果我对lambda进行空检查,作业通常会立即退出,有时会处理一些元素,有时会在挂起一分钟后超时。这是httpStream函数的这个版本:

  private def httpStream(f: String => Unit) = {
    val request = Http(url)
    request
      .execute { inputStream =>
        if (inputStream == null) println("null inputstream")
        else {
          println("not null inputstream")
          fromInputStream(inputStream)
            .getLines()
            .takeWhile(_ => isRunning)
            .foreach(f)
        }
      }
  }

更新#2:

代码实际上在分布式模式下工作,并使用StreamExecutionEnvironment.createLocalEnvironment()

如果我使用start-local.sh并将jar提交给它,我只会遇到这个问题。

0 个答案:

没有答案