我正在尝试读取大型csv文件,该文件具有一个单独的文件,其中包含各列的标题,例如
样本CSV part_000.csv (以竖线分隔):
000c7c09-66d7-47d6-9415-87e5010fe282|2019-04-08|EMAIL|active|43
030c2309-44d7-4676-7815-83e5010f3256|2019-03-18|EMAIL|lapsed|32
示例头文件 _HEADER :
cid|character varying(36)
startdate|date
channel|character varying(20)
status|character varying(6)
age|integer
如何读取CSV文件并使用头文件分配架构?
答案 0 :(得分:2)
您可以基于HEADER文件创建一个架构,然后使用该架构读取数据:
def defineType(str: String): DataType = {
str match {
case "date" => DateType
case "integer" => IntegerType
case x if x.startsWith("character") => StringType
// ... other types and logic
}
}
def createSchema(pathToSchema: String): StructType = {
val schemaDF = spark.read.option("sep", "|").csv(pathToSchema)
val fields: Array[StructField] = schemaDF.collect().map(row => StructField(row.getString(0), defineType(row.getString(1))))
StructType(fields)
}
val schema = createSchema("./data/csv_data/HEADER.csv")
val df = spark.read.option("sep", "|").schema(schema).csv("./data/csv_data/part_000.csv")
df.show(false)
df.printSchema()
输出:
+------------------------------------+----------+-------+------+---+
|cid |startdate |channel|status|age|
+------------------------------------+----------+-------+------+---+
|000c7c09-66d7-47d6-9415-87e5010fe282|2019-04-08|EMAIL |active|43 |
|030c2309-44d7-4676-7815-83e5010f3256|2019-03-18|EMAIL |lapsed|32 |
+------------------------------------+----------+-------+------+---+
root
|-- cid: string (nullable = true)
|-- startdate: date (nullable = true)
|-- channel: string (nullable = true)
|-- status: string (nullable = true)
|-- age: integer (nullable = true)