在scala中读取多个文件到rdd时跳过csv的头

时间:2016-10-17 15:49:01

标签: scala csv apache-spark header rdd

我正在尝试从路径中读取多个csv到rdd中。这条路径有很多csvs有没有办法在将所有csvs读入rdd时避免使用标题?或者使用spotRDD来省略标题而不必使用过滤器或单独处理每个csv然后将它们联合起来?

val path ="file:///home/work/csvs/*"
    val spotsRDD= sc.textFile(path)
    println(spotsRDD.count())

由于

1 个答案:

答案 0 :(得分:1)

可惜你使用的是spark 1.0.0。

您可以将CSV Data Source用于Apache Spark ,但此库需要 Spark 1.3+和btw。此库已内联到Spark 2.x

但我们可以分析并实施类似的东西。

当我们查看com/databricks/spark/csv/DefaultSource.scala时,有

val useHeader = parameters.getOrElse("header", "false")

然后在com/databricks/spark/csv/CsvRelation.scala中有

// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null

baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
  iter.filter(_ != filterLine)
} else {
  iter
}
parseCSV(csvIter, csvFormat)

所以,如果我们假设第一行只在RDD(我们的csv行)中出现一次,我们可以执行类似下面示例的操作:

CSV示例文件:

Latitude,Longitude,Name
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24

scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name

scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28

scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"