Question

我正在运行以下代码来定义案例类：

scala> case class AadharDetails (DateType: Int, Registrar: String,PrivateAgency: String, State: String, District: String, SubDistrict :String, PinCode: Int, Gender: String, Age: Int, AadharGenerated : Int, Rejected: Int, MobileNo: Int,email_id: Int)

定义的类AadharDetails

使用案例类创建DataFrame

scala> val df = spark.read.textFile("/home/anil/spark-2.0.2-bin-   hadoop2.6/aadhaar_data.csv").map(_.split(",")).map(attributes=>AadharDetails (attributes(0).trim.toInt, attributes(1), attributes(2), attributes(3), attributes(4), attributes(5), attributes(6).trim.toInt, attributes(7),attributes(8).trim.toInt, attributes(9).trim.toInt, attributes(10).trim.toInt, attributes(11).trim.toInt, attributes(12).trim.toInt)).toDF()

df: org.apache.spark.sql.DataFrame = [DateType: int, Registrar: string ... 11 more fields]

scala> df.printSchema()
root
|-- DateType: integer (nullable = true)
|-- Registrar: string (nullable = true)
|-- PrivateAgency: string (nullable = true)
|-- State: string (nullable = true)
|-- District: string (nullable = true)
|-- SubDistrict: string (nullable = true)
|-- PinCode: integer (nullable = true)
|-- Gender: string (nullable = true)
|-- Age: integer (nullable = true)
|-- AadharGenerated: integer (nullable = true)
|-- Rejected: integer (nullable = true)
|-- MobileNo: integer (nullable = true)
|-- email_id: integer (nullable = true)


 df.createOrReplaceTempView("data")


scala> spark.sql("select distinct DateType from data").show()
**Will throw an error**, please let me know why distinct does not work here..!!

样本数据： 20150420，阿拉哈巴德银行，A-Onerealtors Pvt Ltd，德里，南德里，国防殖民地，110025，F，49,1,0,0,1

20150420，阿拉哈巴德银行，A-Onerealtors Pvt Ltd，德里，南德里，国防殖民地，110025，F，65,1,0,0,0

Answer 1

由于DateTypevalues列中的数据类型不兼容，可能会发生这种情况。它可能包含不能解释为有效Integer表示形式的null或字符串：

数据框显示也可能会出错。 scala> df.show()

检查您的源数据以确认数据类型不匹配的问题。

选择独特将在Apache Spark DataFrame中不起作用

1 个答案: