Spark Scala UDP在侦听端口上接收

时间:2016-12-25 10:48:38

标签: scala sockets apache-spark udp spark-streaming

中提到的例子 How to: Delete Rows in a DataTable 让我在 TCP 流中接收数据包并收听端口9999

$ nc -lk 9999

我可以通过在我的Linux系统中使用创建数据服务器来通过TCP发送数据 {{1}}

问题
我需要从使用 UDP 和Scala / Spark的Android手机流接收流   val lines = ssc.socketTextStream(" localhost",9999)
仅在TCP流中接收。

如何使用Scala + Spark以类似的简单方式接收UDP流并创建Spark DStream。

2 个答案:

答案 0 :(得分:3)

没有内置的东西,但是自己完成它并不是太多的工作。以下是我根据自定义UdpSocketInputDStream[T]制作的简单解决方案:

import java.io._
import java.net.{ConnectException, DatagramPacket, DatagramSocket, InetAddress}

import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.receiver.Receiver

import scala.reflect.ClassTag
import scala.util.control.NonFatal

class UdpSocketInputDStream[T: ClassTag](
                                          _ssc: StreamingContext,
                                          host: String,
                                          port: Int,
                                          bytesToObjects: InputStream => Iterator[T],
                                          storageLevel: StorageLevel
                                        ) extends ReceiverInputDStream[T](_ssc) {

  def getReceiver(): Receiver[T] = {
    new UdpSocketReceiver(host, port, bytesToObjects, storageLevel)
  }
}

class UdpSocketReceiver[T: ClassTag](host: String,
                                     port: Int,
                                     bytesToObjects: InputStream => Iterator[T],
                                     storageLevel: StorageLevel) extends Receiver[T](storageLevel) {

  var udpSocket: DatagramSocket = _

  override def onStart(): Unit = {

    try {
      udpSocket = new DatagramSocket(port, InetAddress.getByName(host))
    } catch {
      case e: ConnectException =>
        restart(s"Error connecting to $port", e)
        return
    }

    // Start the thread that receives data over a connection
    new Thread("Udp Socket Receiver") {
      setDaemon(true)

      override def run() {
        receive()
      }
    }.start()
  }

  /** Create a socket connection and receive data until receiver is stopped */
  def receive() {
    try {
      val buffer = new Array[Byte](2048)

      // Create a packet to receive data into the buffer
      val packet = new DatagramPacket(buffer, buffer.length)

      udpSocket.receive(packet)

      val iterator = bytesToObjects(new ByteArrayInputStream(packet.getData, packet.getOffset, packet.getLength))
      // Now loop forever, waiting to receive packets and printing them.
      while (!isStopped() && iterator.hasNext) {
        store(iterator.next())
      }

      if (!isStopped()) {
        restart("Udp socket data stream had no more data")
      }
    } catch {
      case NonFatal(e) =>
        restart("Error receiving data", e)
    } finally {
      onStop()
    }
  }

  override def onStop(): Unit = {
    synchronized {
      if (udpSocket != null) {
        udpSocket.close()
        udpSocket = null
      }
    }
  }
}

为了让StreamingContext为自己添加一个方法,我们用一个隐式类来丰富它:

object Implicits {
  implicit class StreamingContextOps(val ssc: StreamingContext) extends AnyVal {
    def udpSocketStream[T: ClassTag](host: String,
                                     port: Int,
                                     converter: InputStream => Iterator[T],
                                     storageLevel: StorageLevel): InputDStream[T] = {
      new UdpSocketInputDStream(ssc, host, port, converter, storageLevel)
    }
  }
}

以下是你如何称呼它:

import java.io.{BufferedReader, InputStream, InputStreamReader}
import java.nio.charset.StandardCharsets

import org.apache.spark.SparkContext
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.reflect.ClassTag

object TestRunner {
  import Implicits._

  def main(args: Array[String]): Unit = {
    val sparkContext = new SparkContext("local[*]", "udpTest")
    val ssc = new StreamingContext(sparkContext, Seconds(4))

    val stream = ssc.udpSocketStream("localhost", 
                                     3003, 
                                     bytesToLines, 
                                     StorageLevel.MEMORY_AND_DISK_SER_2)
    stream.print()

    ssc.start()
    ssc.awaitTermination()
  }

  def bytesToLines(inputStream: InputStream): Iterator[String] = {
    val dataInputStream = new BufferedReader(
      new InputStreamReader(inputStream, StandardCharsets.UTF_8))
    new NextIterator[String] {
      protected override def getNext(): String = {
        val nextValue = dataInputStream.readLine()
        if (nextValue == null) {
          finished = true
        }
        nextValue
      }

      protected override def close() {
        dataInputStream.close()
      }
    }
  }

  abstract class NextIterator[U] extends Iterator[U] {
    protected var finished = false
    private var gotNext = false
    private var nextValue: U = _
    private var closed = false

    override def next(): U = {
      if (!hasNext) {
        throw new NoSuchElementException("End of stream")
      }
      gotNext = false
      nextValue
    }

    override def hasNext: Boolean = {
      if (!finished) {
        if (!gotNext) {
          nextValue = getNext()
          if (finished) {
            closeIfNeeded()
          }
          gotNext = true
        }
      }
      !finished
    }

    def closeIfNeeded() {
      if (!closed) {
        closed = true
        close()
      }
    }

    protected def getNext(): U
    protected def close()
  }
}

这些代码大部分来自Spark提供的SocketInputDStream[T],我只是重复使用它。我还获取了NextIterator使用的bytesToLines的代码,它只是消耗数据包中的行并将其转换为String。如果您有更复杂的逻辑,可以通过传递converter: InputStream => Iterator[T]您自己的实现来提供它。

使用简单的UDP数据包进行测试:

echo -n "hello hello hello!" >/dev/udp/localhost/3003

收率:

-------------------------------------------
Time: 1482676728000 ms
-------------------------------------------
hello hello hello!

当然,这必须进一步测试。我还有一个隐藏的假设,即从buffer创建的每个DatagramPacket都是2048字节,这可能是您想要更改的内容。

答案 1 :(得分:0)

Yuval Itzchakov解决方案的问题在于,接收方收到一条消息并重新启动。只需替换重新启动即可接收,如下所示。

def receive() {
    try {
      val buffer = new Array[Byte](200000)

      // Create a packet to receive data into the buffer
      val packet = new DatagramPacket(buffer, buffer.length)
      udpSocket.receive(packet)

      val iterator = bytesToLines(new ByteArrayInputStream(packet.getData, packet.getOffset, packet.getLength))
      // Now loop forever, waiting to receive packets and printing them.
      while (!isStopped() && iterator.hasNext) {
        store(iterator)
      }

      if (!isStopped()) {
//        restart("Udp socket data stream had no more data")
       receive()
      }
    } catch {
      case NonFatal(e) =>
        restart("Error receiving data", e)
    } finally {
      onStop()
    }
  }