我有一个特殊要求,要针对一列填写所有值(类别)。例如,如下表所示。我想要一种方法来填充代码HL_14108的“未知”和“已分配”类别。
val df = Seq(
("HL_13203","DELIVERED",3226),
("HL_13203","UNSEEN",249),
("HL_13203","UNDELIVERED",210),
("HL_13203","ASSIGNED",2),
("HL_14108","DELIVERED",3083),
("HL_14108","UNDELIVERED",164),
("HL_14108","PICKED",1)).toDF("code","status","count")
输入:
+--------+-----------+-----+
| code| status|count|
+--------+-----------+-----+
|HL_13203| DELIVERED| 3226|
|HL_13203| UNSEEN| 249|
|HL_13203|UNDELIVERED| 210|
|HL_13203| ASSIGNED| 2|
|HL_14108| DELIVERED| 3083|
|HL_14108|UNDELIVERED| 164|
|HL_14108| PICKED| 1|
+--------+-----------+-----+
预期输出:
+--------+-----------+-----+
| code| status|count|
+--------+-----------+-----+
|HL_13203| DELIVERED| 3226|
|HL_13203| UNSEEN| 249|
|HL_13203|UNDELIVERED| 210|
|HL_13203| ASSIGNED| 2|
|HL_13203| PICKED| 0|
|HL_14108| DELIVERED| 3083|
|HL_14108|UNDELIVERED| 164|
|HL_14108| PICKED| 1|
|HL_14108| UNSEEN| 0|
|HL_14108| ASSIGNED| 0|
+--------+-----------+-----+
我想为每个代码添加缺少的类别行。在Apache Spark中做到这一点的正确方法是什么?
答案 0 :(得分:2)
首先使用code
和status
列的所有可能组合创建一个新的数据框。这可以通过不同的方式完成,但最直接的方法是通过交叉连接:
val states = df.select("status").dropDuplicates()
val codes = df.select("code").dropDuplicates()
val df2 = codes.crossJoin(states)
更好的方法是先识别所有可能的状态,然后使用explode
和typedLit
(可从Spark 2.2+版本获得)。这将导致相同的数据帧:
val states = df.select("status").dropDuplicates().as[String].collect()
val codes = df.select("code").dropDuplicates()
val df2 = codes.withColumn("status", explode(typedLit(states)))
对于较早的Spark版本,可以使用typedLit
获得与array(states.map(lit(_)): _*)
相同的功能。
然后,join
与旧数据一起count
获得count
列。没有NaN
值的行将是na.fill(0)
,因此df2.join(df, Seq("code", "status"), "left").na.fill(0)
用于将它们设置为0:
+--------+-----------+-----+
| code| status|count|
+--------+-----------+-----+
|HL_13203|UNDELIVERED| 210|
|HL_13203| ASSIGNED| 2|
|HL_13203| UNSEEN| 249|
|HL_13203| PICKED| 0|
|HL_13203| DELIVERED| 3226|
|HL_14108|UNDELIVERED| 164|
|HL_14108| ASSIGNED| 0|
|HL_14108| UNSEEN| 0|
|HL_14108| PICKED| 1|
|HL_14108| DELIVERED| 3083|
+--------+-----------+-----+
结果数据框:
# Directory containing all files
parent.folder<-"C:/Users/Sam/Big Doc Classification/RAW DATA"
# Return a list of the existing PDF names in parent.folder:
list_of_files <- list.files(parent.folder,
pattern = "*.pdf", full.names = TRUE)
# Rename all files
for(i in 1:length(list_of_files)){
file.rename(list_of_files[i], paste0("FS", i, ".pdf"))
}