Question

我已成功使用RDD.pipe将RDD一次一行地传递给python程序。

问题是python程序不知道如何解释该行，除非它有关于模式的一些信息。

有没有办法将RDD架构信息传递给外部程序？

我希望可能允许传递printPipeContext和printRDDElement的{{1}}版本可能有用，但我无法从ScalaDoc或来源告诉他们的目的是什么。

它们可以用于将架构信息传递给外部程序吗？如果是这样的话？如果没有，它们对什么有用？一个简单的例子会有所帮助。

这是我目前的（通过）测试用例：

test("Pipe to python (with metadata)") {

    val dataRdd: RDD[Row] = SPARK_SESSION.sparkContext.makeRDD(
        Seq(
            Row("first", 2.0, 7.0, "A"),
            Row("second", 3.5, 2.5, "B"),
            Row("third", 7.0, 5.9, "B"),
            Row("fourth", 27.0, 15.8, "A")
        )
    )

    val scriptPath = HOME + "/src/test/resources/python/process.py"

    def printPipeContext(prnt: String => Unit): Unit = {
        // what could go here to convey schema?
    }

    def printRDDElement(record:Row, f: String => Unit): Unit = {
        // what could go here to convey schema?
    }

    val resultRdd = dataRdd.pipe(Seq(scriptPath), Map("SEPARATOR" -> ","), 
                                 printPipeContext = null, printRDDElement = null) 

    val result: Array[String] = resultRdd.collect()
    assertResult(
      strip("""hello first ,  14.0
              |hello second ,  8.75
              |hello third ,  41.3
              |hello fourth ,  426.6""")) {
      result.mkString("\n")
    }
}

这是外部python程序。

它目前假设第2列和第3列是Floats并将它们相乘。

#!/usr/bin/python

import sys
import os

for line in sys.stdin:
    values = line[1:-2].split(os.environ['SEPARATOR'])
    print "hello " + values[0], ", ", (float(values[1]) * float(values[2]))

但是，如果我想要做的是将所有Float列相乘？如何判断哪些列的类型为Float？

使用管道与外部程序处理RDD时如何传递模式信息？

0 个答案: