使用map()从Scala数组中选择多个任意列

时间:2015-07-09 10:10:56

标签: scala csv apache-spark

我是Scala(和Spark)的新手。我试图读取csv文件并从数据中提取多个任意列。以下函数执行此操作,但使用硬编码列索引:

def readCSV(filename: String, sc: SparkContext): RDD[String] = {
  val input = sc.textFile(filename).map(line => line.split(","))
  val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
  return out
}

有没有办法使用带有传递给数组函数的任意数量的列索引的map?

1 个答案:

答案 0 :(得分:2)

如果你有一系列索引,你可以映射它并返回值:

scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))

scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)

// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))

// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6)  // that's actually a List containing 2 String