+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|ID_NOTIFICATION|ID_ENTITE|ID_ENTITE_GARANTE|CD_ETAT|DT_ETAT |CD_ANOMALIE|CD_TYPE_DESTINATAIRE|CD_TYPE_EVENEMENT |CD_SYS_APPELANT|TYP_MVT|DT_DEBUT |DT_FIN |
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|3110305 |GNE |GNE |AT |2019-06-12 00:03:14|null |null |REL_CP_ULTIME_PAPIER|SIGMA |C |2019-06-12 00:03:22|2019-06-12 00:03:32|
|3110305 |GNE |GNE |AN |2019-06-12 00:03:28|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:08:43|
|3110305 |GNE |GNE |AN |2019-06-12 00:03:28|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:10:06|
|3110305 |GNE |GNE |AN |2019-06-12 15:10:02|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:10:51|
|3110305 |GNE |GNE |AN |2019-06-12 15:10:02|017 |IDGRC |REL_CP_ULTIME_PAPIER|SIGMA |M |2019-06-12 00:03:22|2019-06-12 15:11:35|
有没有一种方法可以使每个不同的CD_ETAT
列都排成一行?在这种情况下,它将是前两行。
类似于this SQL solution,但在Scala中请使用DF函数。谢谢
答案 0 :(得分:2)
您可以使用partitionBy
CD_ETAT
进行窗口功能,然后选择orderBy
以获取第一个
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val window = Window.partitionBy("CD_ETAT").orderBy("DT_ETAT")
df.withColumn("row_num", row_number().over(window))
.filter($"row_num" === 1)
.drop("row_num")
输出:
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
|ID_NOTIFICATION|ID_ENTITE|ID_ENTITE_GARANTE|CD_ETAT| DT_ETAT|CD_ANOMALIE|CD_TYPE_DESTINATAIRE| CD_TYPE_EVENEMENT|CD_SYS_APPELANT|TYP_MVT| DT_DEBUT| DT_FIN|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
| 3110305| GNE| GNE| AT|2019-06-12 00:03:14| null| null|REL_CP_ULTIME_PAPIER| SIGMA| C|2019-06-12 00:03:22|2019-06-12 00:03:32|
| 3110305| GNE| GNE| AN|2019-06-12 00:03:28| 017| IDGRC|REL_CP_ULTIME_PAPIER| SIGMA| M|2019-06-12 00:03:22|2019-06-12 15:08:43|
+---------------+---------+-----------------+-------+-------------------+-----------+--------------------+--------------------+---------------+-------+-------------------+-------------------+
答案 1 :(得分:0)
如果您想要数据帧的不同行,则解决方案可以直接使用.distinct()
。
.distinct()
返回数据帧的不同行,但是在您的情况下,由于只有其他两行(DT_ETAT,DT_FIN)具有不同的值,因此您将没有只有两行的数据帧。
针对您的情况,也许一个简单的解决方案是选择不包含(DT_ETAT,DT_FIN)的列,然后使用.distinct()
。
val new_df=df.select("ID_NOTIFICATION", "ID_ENTITE", "ID_ENTITE_GARANTE", "CD_ETAT", ..).distinct()
# Take a look in the results
new_df.show()