我有一个Spark RDD [String],我希望将其流式传输到本地计算机上的外部命令的输入。设置将是这样的
val data: RDD[String] = <Valid data>
val process = Seq("wc", "-l") // This is not the actual process, but it works the same way as it consumes a whole bunch of lines and produces very little output itself
// Here's what I've tried so far
val exitCode = (process #< data.toLocalIterator.toStream) ! // Doesn't work
val exitCode = (process #< new ByteArrayInputStream(data.toLocalIterator.mkString("\n").getBytes("UTF-8"))) ! // Works but seems to load the whole data into local memory which is definitely not what I want as data could be very big
val processIO = new ProcessIO(
in => data.toLocalIterator.toStream,
out => scala.io.Source.fromInputStream(out).getLines.foreach(println),
err => scala.io.Source.fromInputStream(err).getLines.foreach(println))
val exitCode = process.run(processIO) // This also doesn't work
任何人都可以指出一个工作解决方案,它不会加载本地计算机上的所有数据,只是将它从RDD [String]直接传输到流程,就像我做的那样< / p>
cat data.txt | wc -l
在命令行上。
由于
答案 0 :(得分:0)
我想我已经弄明白了。似乎我忘了实际向InputStream写任何东西。这是代码,似乎适用于我的小测试。我还没有对大数据进行测试,但它看起来应该可行。
val processIO = BasicIO.standard(in => {
data.toLocalIterator.foreach(x => in.write((x + Properties.lineSeparator).getBytes(Charsets.UTF_8)))
in.close
})
val exitCode = process.run(processIO).exitValue
答案 1 :(得分:0)
这不是一个答案,但你应该知道它不会像cat data.txt | wc -l
那样,因为RDD可以(通常会)分成多个进程(在执行程序中运行的任务),所以你的接受程序需要能够获得多个流,你应该知道数据不会被订购