无法在dataframe中定义csv文件的架构

时间:2018-04-12 08:00:10

标签: scala dataframe

我正在尝试使用类似这样的case类来定义csv文件的模式:

final case class AadharData(date:String,registrar:String,agency:String,state:String,district:String,subDistrict:String,pinCode:String,gender:String,age:String,aadharGenerated:String,rejected:String,mobileNo:Double,email:String);

在将模式分配给csv文件时会自动添加一列:

val colNames = classOf[AadharData].getDeclaredFields.map(x=>x.getName)
val df = spark.read.option("header", false).csv("/home/harsh/Hunny/HadoopPractice/Spark/DF/AadharAnalysis/aadhaar_data.csv").toDF(colNames:_*).as[AadharData]

这是我为colNames获取的内容:

val df = spark.read.option("header", false).csv("/home/harsh/Hunny/HadoopPractice/Spark/DF/AadharAnalysis/aadhaar_data.csv").toDF(colNames:_*).as[AadharData]

df变量的错误:

java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match.
Old column names (13): _c0, _c1, _c2, _c3, _c4, _c5, _c6, _c7, _c8, _c9, _c10, _c11, _c12
New column names (14): date, registrar, agency, state, district, subDistrict, pinCode, gender, age, aadharGenerated, rejected, mobileNo, email, $outer
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.sql.Dataset.toDF(Dataset.scala:376)
  ... 54 elided

1 个答案:

答案 0 :(得分:1)

看起来您在colNames中指定的架构与原始数据帧所具有的架构相比有所不同。您可以尝试以下方法:

  1. 使用df.printSchema
  2. 打印toDF(colNames:_*)之前的数据框架
  3. 确保这两个列具有相同的列数
  4. 祝你好运