Spark读取带有注释标题的csv

时间:2018-02-20 00:48:14

标签: scala csv apache-spark

我有以下文件,我需要在scala中使用spark阅读 -

#Version: 1.0
#Fields: date time location timezone
2018-02-02  07:27:42 US LA
2018-02-02  07:27:42 UK LN

我目前正在尝试使用以下 -

提取字段
spark.read.csv(filepath)

我是spark + scala的新手,想知道有没有更好的方法根据文件顶部的#Field行提取字段。

2 个答案:

答案 0 :(得分:1)

您应该使用 sparkContext的textFile api来阅读文本文件,然后使用filter 标题行

val rdd = sc.textFile("filePath")

val header = rdd
  .filter(line => line.toLowerCase.contains("#fields:"))
  .map(line => line.split(" ").tail)
  .first()

那应该是它。

现在,如果您要创建数据框,那么您应解析以形成schema,然后filter 数据行以形成< EM>行。最后使用 SQLContext 创建数据帧

import org.apache.spark.sql.types._
val schema = StructType(header.map(title => StructField(title, StringType, true)))

val dataRdd = rdd.filter(line => !line.contains("#")).map(line => Row.fromSeq(line.split(" ")))

val df = sqlContext.createDataFrame(dataRdd, schema)

df.show(false)

这应该给你

+----------+--------+--------+--------+
|date      |time    |location|timezone|
+----------+--------+--------+--------+
|2018-02-02|07:27:42|US      |LA      |
|2018-02-02|07:27:42|UK      |LN      |
+----------+--------+--------+--------+

注意:如果文件是 tab 分隔,而不是

line.split(" ")

您应该使用\t

line.split("\t")

答案 1 :(得分:0)

示例输入文件“ example.csv

#Version: 1.0
#Fields: date time location timezone
2018-02-02 07:27:42 US LA
2018-02-02 07:27:42 UK LN

Test.scala

import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession.Builder
import org.apache.spark.sql._

import scala.util.Try

object Test extends App {

  // create spark session and sql context
  val builder: Builder = SparkSession.builder.appName("testAvroSpark")
  val sparkSession: SparkSession = builder.master("local[1]").getOrCreate()
  val sc: SparkContext = sparkSession.sparkContext
  val sqlContext: SQLContext = sparkSession.sqlContext

  case class CsvRow(date: String, time: String, location: String, timezone: String)

  // path of your csv file
  val path: String =
    "sample.csv"

  // read csv file and skip firs two lines

  val csvString: Seq[String] =
    sc.textFile(path).toLocalIterator.drop(2).toSeq

  // try to read only valid rows
  val csvRdd: RDD[(String, String, String, String)] =
    sc.parallelize(csvString).flatMap(r =>
      Try {
        val row: Array[String] = r.split(" ")
        CsvRow(row(0), row(1), row(2), row(3))
      }.toOption)
      .map(csvRow => (csvRow.date, csvRow.time, csvRow.location, csvRow.timezone))

  import sqlContext.implicits._

  // make data frame
  val df: DataFrame =
    csvRdd.toDF("date", "time", "location", "timezone")

  // display dataf frame
  df.show()
}