Question

我有一个Spark RDD，其中每个元素都是(key, input)形式的元组。我想使用pipe方法将输入传递给外部可执行文件，并生成(key, output)形式的新RDD。我需要以后用于关联的密钥。

以下是使用spark-shell的示例：

val data = sc.parallelize(
  Seq(
    ("file1", "one"),
    ("file2", "two two"),
    ("file3", "three three three")))

// Incorrectly processes the data (calls toString() on each tuple)
data.pipe("wc")

// Loses the keys, generates extraneous results
data.map( elem => elem._2 ).pipe("wc")

提前致谢。

Answer 1

带有map的解决方案不正确，因为map不保证保留分区，因此使用zip后将失败。您需要使用mapValues来保留初始RDD的分区。

data.zip( 
  data.mapValues{ _.toString }.pipe("my_executable")
).map { case ((key, input), output) => 
  (key, output)
}

Answer 2

考虑到您无法将标签输入/输出可执行文件，可能工作：

rdd
  .map(x => x._1)
  .zip(rdd
          .map(x => x._2)
          .pipe("my executable"))

请注意，这可能很脆弱，如果您的可执行文件在每个输入记录上没有产生完全一行，肯定会破坏。

来自元组的Spark RDD管道值

2 个答案: