Question

我有一个带有多个列的RDD（例如数百个），我的大多数操作都在列上，例如我需要从不同的列创建许多中间变量。

最有效的方法是什么？

例如，如果我的dataRDD[Array[String]]如下所示：

123, 523, 534, ..., 893 
536, 98, 1623, ..., 98472 
537, 89, 83640, ..., 9265 
7297, 98364, 9, ..., 735 
...... 
29, 94, 956, ..., 758

我需要创建一个新列或变量newCol1 = 2ndCol+19thCol，另一个新列基于newCol1和现有列：newCol2 = function(newCol1, 34thCol)。

这样做的最佳方式是什么？

我一直在考虑使用索引作为中间变量和dataRDD，然后将它们连接在索引上以进行计算：

var dataRDD = sc.textFile("/test.csv").map(_.split(","))
val dt = dataRDD.zipWithIndex.map(_.swap)
val newCol1 = dataRDD.map(x => x(1)+x(18)).zipWithIndex.map(_.swap)
val newCol2 = newCol1.join(dt).map(x=> function(.........))

有更好的方法吗？

Answer 1

为什么不把它全部合二为一：

var dataRDD = sc.textFile("/test.csv").map(_.split(","))
dataRDD.map(x=>{
  val newCol = x(1) + x(18)
  val newCol2 = function(newCol, x(33))
  //anything else you need to do
  newCol +: newCol2 +: x //This will return the original array with the new columns prepended
  //x +: newCol +: newCol2 //Alternatively, this will return the original array with the new columns appended
})

Spark RDD上的列操作

1 个答案: