我正在运行以下代码来定义案例类:
scala> case class AadharDetails (DateType: Int, Registrar: String,PrivateAgency: String, State: String, District: String, SubDistrict :String, PinCode: Int, Gender: String, Age: Int, AadharGenerated : Int, Rejected: Int, MobileNo: Int,email_id: Int)
定义的类AadharDetails
使用案例类创建DataFrame
scala> val df = spark.read.textFile("/home/anil/spark-2.0.2-bin- hadoop2.6/aadhaar_data.csv").map(_.split(",")).map(attributes=>AadharDetails (attributes(0).trim.toInt, attributes(1), attributes(2), attributes(3), attributes(4), attributes(5), attributes(6).trim.toInt, attributes(7),attributes(8).trim.toInt, attributes(9).trim.toInt, attributes(10).trim.toInt, attributes(11).trim.toInt, attributes(12).trim.toInt)).toDF()
df: org.apache.spark.sql.DataFrame = [DateType: int, Registrar: string ... 11 more fields]
scala> df.printSchema()
root
|-- DateType: integer (nullable = true)
|-- Registrar: string (nullable = true)
|-- PrivateAgency: string (nullable = true)
|-- State: string (nullable = true)
|-- District: string (nullable = true)
|-- SubDistrict: string (nullable = true)
|-- PinCode: integer (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: integer (nullable = true)
|-- AadharGenerated: integer (nullable = true)
|-- Rejected: integer (nullable = true)
|-- MobileNo: integer (nullable = true)
|-- email_id: integer (nullable = true)
df.createOrReplaceTempView("data")
scala> spark.sql("select distinct DateType from data").show()
**Will throw an error**, please let me know why distinct does not work here..!!
样本数据: 20150420,阿拉哈巴德银行,A-Onerealtors Pvt Ltd,德里,南德里,国防殖民地,110025,F,49,1,0,0,1
20150420,阿拉哈巴德银行,A-Onerealtors Pvt Ltd,德里,南德里,国防殖民地,110025,F,65,1,0,0,0
答案 0 :(得分:0)
由于DateTypevalues列中的数据类型不兼容,可能会发生这种情况。它可能包含不能解释为有效Integer表示形式的null或字符串:
数据框显示也可能会出错。
scala> df.show()
检查您的源数据以确认数据类型不匹配的问题。