Spark Scala将Dataframe空白记录替换为“0”

时间:2017-03-14 02:08:49

标签: scala apache-spark apache-spark-sql

我需要将我的Dataframe字段的空白记录替换为“0”

这是我的代码 - >

import sqlContext.implicits._

case class CInspections (business_id:Int, score:String, date:String, type1:String)

val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,raw_inspections(1), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
raw_inspectionsDF.show()

我正在使用case类,然后转换为Dataframe。但我需要“得分”作为Int,因为我必须执行一些操作并对其进行排序。 但是,如果我将其声明为得分:Int则我得到空值的错误。

java.lang.NumberFormatException:对于输入字符串:“”

+-----------+-----+--------+--------------------+
|business_id|score|    date|               type1|
+-----------+-----+--------+--------------------+
|         10|     |20140807|Reinspection/Foll...|
|         10|   94|20140729|Routine - Unsched...|
|         10|     |20140124|Reinspection/Foll...|
|         10|   92|20140114|Routine - Unsched...|
|         10|   98|20121114|Routine - Unsched...|
|         10|     |20120920|Reinspection/Foll...|
|         17|     |20140425|Reinspection/Foll...|
+-----------+-----+--------+--------------------+

我需要将得分字段作为Int,因为对于以下查询,它排序为String而不是Int并给出错误的结果

sqlContext.sql("""select raw_inspectionsDF.score  from raw_inspectionsDF where score <>"" order by score""").show()

+-----+
|score|
+-----+
|  100|
|  100|
|  100|
+-----+

1 个答案:

答案 0 :(得分:1)

空字符串无法转换为整数,您需要设置分数nullable,这样如果字段丢失,则表示为空,您可以尝试以下操作:

import scala.util.{Try, Success, Failure}

1)定义一个自定义的解析函数,如果该字符串无法转换为Int,则返回None,在您的情况下为空字符串;

def parseScore(s: String): Option[Int] = {
  Try(s.toInt) match {
    case Success(x) => Some(x)
    case Failure(x) => None
  }
}

2)将案例类中的得分字段定义为Option[Int]类型;

case class CInspections (business_id:Int, score: Option[Int], date:String, type1:String)

val raw_inspections = sc.textFile("test.csv")
val raw_inspectionsmap = raw_inspections.map(line => line.split("\t"))

3)使用自定义 parseScore 功能解析得分字段;

val raw_inspectionsRDD = raw_inspectionsmap.map(raw_inspections => 
    CInspections(raw_inspections(0).toInt, parseScore(raw_inspections(1)), 
                 raw_inspections(2),raw_inspections(3)))

val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")

raw_inspectionsDF.printSchema
//root
// |-- business_id: integer (nullable = false)
// |-- score: integer (nullable = true)
// |-- date: string (nullable = true)
// |-- type1: string (nullable = true)

raw_inspectionsDF.show()

+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
|          1| null|   a|    b|
|          2|    3|   s|    k|
+-----------+-----+----+-----+

4)正确解析文件后,您可以使用 na 函数 fill 轻松地将空值替换为0:

raw_inspectionsDF.na.fill(0).show
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
|          1|    0|   a|    b|
|          2|    3|   s|    k|
+-----------+-----+----+-----+