使用Apache Spark填充行中的缺失值

时间:2019-04-23 05:26:00

标签: scala apache-spark dataframe apache-spark-sql

我有一个特殊要求,要针对一列填写所有值(类别)。例如,如下表所示。我想要一种方法来填充代码HL_14108的“未知”和“已分配”类别。

val df = Seq(
("HL_13203","DELIVERED",3226), 
("HL_13203","UNSEEN",249),     
("HL_13203","UNDELIVERED",210),
("HL_13203","ASSIGNED",2),    
("HL_14108","DELIVERED",3083), 
("HL_14108","UNDELIVERED",164),
("HL_14108","PICKED",1)).toDF("code","status","count")

输入:

+--------+-----------+-----+
|    code|     status|count|
+--------+-----------+-----+
|HL_13203|  DELIVERED| 3226|
|HL_13203|     UNSEEN|  249|
|HL_13203|UNDELIVERED|  210|
|HL_13203|   ASSIGNED|    2|
|HL_14108|  DELIVERED| 3083|
|HL_14108|UNDELIVERED|  164|
|HL_14108|     PICKED|    1|
+--------+-----------+-----+

预期输出:

+--------+-----------+-----+
|    code|     status|count|
+--------+-----------+-----+
|HL_13203|  DELIVERED| 3226|
|HL_13203|     UNSEEN|  249|
|HL_13203|UNDELIVERED|  210|
|HL_13203|   ASSIGNED|    2|
|HL_13203|     PICKED|    0|
|HL_14108|  DELIVERED| 3083|
|HL_14108|UNDELIVERED|  164|
|HL_14108|     PICKED|    1|
|HL_14108|     UNSEEN|    0|
|HL_14108|   ASSIGNED|    0|
+--------+-----------+-----+

我想为每个代码添加缺少的类别行。在Apache Spark中做到这一点的正确方法是什么?

1 个答案:

答案 0 :(得分:2)

首先使用codestatus列的所有可能组合创建一个新的数据框。这可以通过不同的方式完成,但最直接的方法是通过交叉连接:

val states = df.select("status").dropDuplicates()
val codes = df.select("code").dropDuplicates()
val df2 = codes.crossJoin(states)

更好的方法是先识别所有可能的状态,然后使用explodetypedLit(可从Spark 2.2+版本获得)。这将导致相同的数据帧:

val states = df.select("status").dropDuplicates().as[String].collect()
val codes = df.select("code").dropDuplicates()
val df2 = codes.withColumn("status", explode(typedLit(states)))

对于较早的Spark版本,可以使用typedLit获得与array(states.map(lit(_)): _*)相同的功能。


然后,join与旧数据一起count获得count列。没有NaN值的行将是na.fill(0),因此df2.join(df, Seq("code", "status"), "left").na.fill(0) 用于将它们设置为0:

+--------+-----------+-----+
|    code|     status|count|
+--------+-----------+-----+
|HL_13203|UNDELIVERED|  210|
|HL_13203|   ASSIGNED|    2|
|HL_13203|     UNSEEN|  249|
|HL_13203|     PICKED|    0|
|HL_13203|  DELIVERED| 3226|
|HL_14108|UNDELIVERED|  164|
|HL_14108|   ASSIGNED|    0|
|HL_14108|     UNSEEN|    0|
|HL_14108|     PICKED|    1|
|HL_14108|  DELIVERED| 3083|
+--------+-----------+-----+

结果数据框:

# Directory containing all files
  parent.folder<-"C:/Users/Sam/Big Doc Classification/RAW DATA"

# Return a list of the existing PDF names in parent.folder:
  list_of_files <- list.files(parent.folder, 
                       pattern = "*.pdf", full.names = TRUE)
# Rename all files
  for(i in 1:length(list_of_files)){
            file.rename(list_of_files[i], paste0("FS", i, ".pdf"))
  }