如何在单独的文件中读取标头定义的CSV文件?

时间:2019-09-20 17:49:50

标签: scala apache-spark apache-spark-sql

我正在尝试读取大型csv文件,该文件具有一个单独的文件,其中包含各列的标题,例如

样本CSV part_000.csv (以竖线分隔):

000c7c09-66d7-47d6-9415-87e5010fe282|2019-04-08|EMAIL|active|43
030c2309-44d7-4676-7815-83e5010f3256|2019-03-18|EMAIL|lapsed|32

示例头文件 _HEADER

cid|character varying(36)
startdate|date
channel|character varying(20)
status|character varying(6)
age|integer

如何读取CSV文件并使用头文件分配架构?

1 个答案:

答案 0 :(得分:2)

您可以基于HEADER文件创建一个架构,然后使用该架构读取数据:

 def defineType(str: String): DataType = {
    str match {
      case "date" => DateType
      case "integer" => IntegerType
      case x if x.startsWith("character") => StringType
      //  ... other types and logic
    }
  }

  def createSchema(pathToSchema: String): StructType = {
    val schemaDF = spark.read.option("sep", "|").csv(pathToSchema)
    val fields: Array[StructField] = schemaDF.collect().map(row => StructField(row.getString(0), defineType(row.getString(1))))
    StructType(fields)
  }

  val schema = createSchema("./data/csv_data/HEADER.csv")

  val df = spark.read.option("sep", "|").schema(schema).csv("./data/csv_data/part_000.csv")

  df.show(false)
  df.printSchema()

输出:

+------------------------------------+----------+-------+------+---+
|cid                                 |startdate |channel|status|age|
+------------------------------------+----------+-------+------+---+
|000c7c09-66d7-47d6-9415-87e5010fe282|2019-04-08|EMAIL  |active|43 |
|030c2309-44d7-4676-7815-83e5010f3256|2019-03-18|EMAIL  |lapsed|32 |
+------------------------------------+----------+-------+------+---+

root
 |-- cid: string (nullable = true)
 |-- startdate: date (nullable = true)
 |-- channel: string (nullable = true)
 |-- status: string (nullable = true)
 |-- age: integer (nullable = true)