Question

我在scala中为Spark Streaming编写了程序。在我的程序中，我在socketTextStream下传递了“remote-host”和“remote port”。

在远程计算机中，我有一个调用系统命令的perl脚本：

echo 'data_str' | nc <remote_host> <9999>

通过这种方式，我的火花程序能够获取数据，但由于我有多台需要将数据发送到火花机的远程机器，所以看起来有点混乱。我想知道正确的做法。事实上，我将如何处理来自多个主机的数据？

参考，我目前的计划：

def main(args: Array[String]): Unit = {
    val conf = new SparkConf().setAppName("HBaseStream")
    val sc = new SparkContext(conf)

    val ssc = new StreamingContext(sc, Seconds(2))

    val inputStream = ssc.socketTextStream(<remote-host>, 9999)
    -------------------
    -------------------

    ssc.start()
    // Wait for the computation to terminate
    ssc.awaitTermination()

  }
}

提前致谢。

Answer 1

您可以从"Level of Parallelism in Data Receiving"找到更多信息。

摘要：

因此可以通过创建来实现接收多个数据流多输入DStream并配置它们以接收不同的来自源的数据流的分区;
这些多个DStream可以组合在一起创建一个 DSTREAM。然后是在单个上应用的转换输入DStream可以应用于统一流。

来自多个远程主机的网络Spark Streaming

1 个答案: