将Schema添加到从文件加载的Spark DataFrame时出错

时间:2018-05-20 21:17:14

标签: apache-spark apache-spark-sql data-conversion

Where-Object

下面的test.csv数据样本

val tableDF = spark.read.option("delimiter",",").csv("/Volumes/Data/ap/click/test.csv")
import org.apache.spark.sql.types.{StringType, StructField, StructType, IntegerType}

val schemaTd = StructType(List(StructField("time_id",IntegerType),StructField("week",IntegerType),StructField("month",IntegerType),StructField("calendar",StringType)))

val result = spark.createDataFrame(tableDF,schemaTd)

除了最后一个值之外的所有列都是文件中的Int类型仍然出错

6659,951,219,2018-03-25 00:00:00
6641,949,219,2018-03-07 00:00:00
6645,949,219,2018-03-11 00:00:00
6638,948,219,2018-03-04 00:00:00
6646,950,219,2018-03-12 00:00:00
6636,948,219,2018-03-02 00:00:00
6643,949,219,2018-03-09 00:00:00

1 个答案:

答案 0 :(得分:1)

在这种情况下,您应该为DataFrameReader提供架构:

import org.apache.spark.sql.types._

val schemaTd = StructType(List(
   StructField("time_id",IntegerType),
   StructField("week",IntegerType),
   StructField("month",IntegerType),
   StructField("calendar",StringType)))

val tableDF = spark.read.option("delimiter",",")
  .schema(schemaTd)
  .csv("/Volumes/Data/ap/click/test.csv")

Dataset创建RDD[Row]时(我假设您的实际代码是spark.createDataFrame(tableDF.rdd, schemaTd),否则它不应该真正编译),类型必须与模式一致。您无法提供String(CSV阅读器的默认类型)并使用IntegerType声明架构。