我正在尝试从路径中读取多个csv到rdd中。这条路径有很多csvs有没有办法在将所有csvs读入rdd时避免使用标题?或者使用spotRDD来省略标题而不必使用过滤器或单独处理每个csv然后将它们联合起来?
val path ="file:///home/work/csvs/*"
val spotsRDD= sc.textFile(path)
println(spotsRDD.count())
由于
答案 0 :(得分:1)
可惜你使用的是spark 1.0.0。
您可以将CSV Data Source用于Apache Spark ,但此库需要 Spark 1.3+
和btw。此库已内联到Spark 2.x
。
但我们可以分析并实施类似的东西。
当我们查看com/databricks/spark/csv/DefaultSource.scala
时,有
val useHeader = parameters.getOrElse("header", "false")
然后在com/databricks/spark/csv/CsvRelation.scala中有
// If header is set, make sure firstLine is materialized before sending to executors.
val filterLine = if (useHeader) firstLine else null
baseRDD().mapPartitions { iter =>
// When using header, any input line that equals firstLine is assumed to be header
val csvIter = if (useHeader) {
iter.filter(_ != filterLine)
} else {
iter
}
parseCSV(csvIter, csvFormat)
所以,如果我们假设第一行只在RDD
(我们的csv行)中出现一次,我们可以执行类似下面示例的操作:
CSV示例文件:
Latitude,Longitude,Name
48.1,0.25,"First point"
49.2,1.1,"Second point"
47.5,0.75,"Third point"
scala> val csvData = sc.textFile("test.csv")
csvData: org.apache.spark.rdd.RDD[String] = test.csv MapPartitionsRDD[24] at textFile at <console>:24
scala> val header = csvDataRdd.first
header: String = Latitude,Longitude,Name
scala> val csvDataWithoutHeaderRdd = csvDataRdd.mapPartitions{iter => iter.filter(_ != header)}
csvDataWithoutHeaderRdd: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[25] at mapPartitions at <console>:28
scala> csvDataWithoutHeaderRdd.foreach(println)
49.2,1.1,"Second point"
48.1,0.25,"First point"
47.5,0.75,"Third point"