我有一个像这样的数据框:
| ID_VISITE_CALCULE| TAG_TS_TO_TS|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| RK|
+--------------------+-------------------+------------------+------------------------+---+
|GA1.2.1023040287....|2019-04-23 11:24:19| dupont| null| 1|
|GA1.2.1023040287....|2019-04-23 11:24:19| durand| null| 2|
|GA1.2.105243141.1...|2019-04-23 11:21:01| null| null| 1|
|GA1.2.1061963529....|2019-04-23 11:12:19| null| null| 1|
|GA1.2.1065635192....|2019-04-23 11:07:14| antoni| null| 1|
|GA1.2.1074357108....|2019-04-23 11:11:34| lang| null| 1|
|GA1.2.1074357108....|2019-04-23 11:12:37| lang| null| 2|
|GA1.2.1075803022....|2019-04-23 11:28:38| cavail| null| 1|
|GA1.2.1080137035....|2019-04-23 11:20:00| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| null| null| 1|
|GA1.2.1081805479....|2019-04-23 11:10:49| linare| null| 2|
|GA1.2.1111218536....|2019-04-23 11:28:43| null| null| 1|
|GA1.2.1111218536....|2019-04-23 11:32:26| null| null| 2|
|GA1.2.1111570355....|2019-04-23 11:07:00| null| null| 1|
+--------------------+-------------------+------------------+------------------------+---+
我正在尝试应用规则以按ID_VISITE_CALCULE进行汇总,并且仅保留一行ID。
对于ID(一组),我希望:
获取该组的第一个时间戳并将其存储在START列中
获取组的最后一个时间戳并将其存储在END列中
测试整个组的EXTERNAL_PERSON_ID是否相同。 如果是这种情况,并且它为NULL,那么我写NULL;如果是这种情况,并且它是一个名称,那么我写该名称。最后,如果组中的值不同,那么我将注册UNDEFINED
RESULT :
+--------------------+------------------+------------------------+-------------------+-------------------+
| ID_VISITE_CALCULE|EXTERNAL_PERSON_ID|EXTERNAL_ORGANISATION_ID| START| END|
+--------------------+------------------+------------------------+-------------------+-------------------+
|GA1.2.1023040287....| undefined| null|2019-04-23 11:24:19|2019-04-23 11:24:19|
|GA1.2.105243141.1...| null| null|2019-04-23 11:21:01|2019-04-23 11:21:01|
|GA1.2.1061963529....| null| null|2019-04-23 11:12:19|2019-04-23 11:12:19|
|GA1.2.1065635192....| antoni| null|2019-04-23 11:07:14|2019-04-23 11:07:14|
|GA1.2.1074357108....| lang| null|2019-04-23 11:11:34|2019-04-23 11:12:37|
|GA1.2.1075803022....| cavail| null|2019-04-23 11:28:38|2019-04-23 11:28:38|
|GA1.2.1080137035....| null| null|2019-04-23 11:20:00|2019-04-23 11:20:00|
|GA1.2.1081805479....| undefined| null|2019-04-23 11:10:49|2019-04-23 11:10:49|
|GA1.2.1111218536....| null| null|2019-04-23 11:28:43|2019-04-23 11:32:26|
|GA1.2.1111570355....| null| null|2019-04-23 11:07:00|2019-04-23 11:07:00|
+--------------------+------------------+------------------------+-------------------+-------------------+
在我的示例中,一个组最多只能有2行,但是在实际数据集中,一个组中可以有数百行。
谢谢您的帮助。
答案 0 :(得分:0)
/如果您从构建数据框的位置显示一些代码/示例数据,那就太好了。
假设您的数据框为tableDf
** Spark Sql解决方案**
tableDf.createOrReplaceTempView("input_table")
val sqlStr ="""
select ID_VISITE_CALCULE,
(case when count(distinct person_id_calculation) > 1 then "undefined"
when count(distinct person_id_calculation) = 1 and
max(person_id_calculation) = "noNull" then ""
else max(person_id_calculation)) as EXTERNAL_PERSON_ID,
-- do the same for EXTERNAL_ORGANISATION_ID
max(start_v) as start_v, max(last_v) as last_v
from
(select ID_VISITE_CALCULE,
( case
when nvl(EXTERNAL_PERSON_ID,"noNull") =
lag(EXTERNAL_PERSON_ID,1,"noNull")over(partition by
ID_VISITE_CALCULE order by TAG_TS_TO_TS) then
EXTERNAL_PERSON_ID
else "undefined" end ) AS person_id_calculation,
-- Same calculation for EXTERNAL_ORGANISATION_ID
first(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as START_V,
last(TAG_TS_TO_TS) over(partition by ID_VISITE_CALCULE order by
TAG_TS_TO_TS) as last_V
from input_table ) a
group by 1
"""
val resultDf = spark.sql(sqlStr)
答案 1 :(得分:0)
所有操作都可以在单个groupby调用中完成,但是我建议(略)提高性能,并考虑将代码分为2个调用的可读性:
import org.apache.spark.sql.functions.{col, size, collect_set, max, min, when, lit}
val res1DF = df.groupBy(col("ID_VISITE_CALCULE")).agg(
min(col("START")).alias("START"),
max(col("END")).alias("END"),
collect_set(col("EXTERNAL_PERSON_ID")).alias("EXTERNAL_PERSON_ID"),
collect_set(col("EXTERNAL_ORGANIZATION_ID")).alias("EXTERNAL_ORGANIZATION_ID")
)
val res2DF = res1DF.withColumn("EXTERNAL_PERSON_ID",
when(
size(col("EXTERNAL_PERSON_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_PERSON_ID").getItem(0)
)
).withColumn("EXTERNAL_ORGANIZATION_ID",
when(
size(col("EXTERNAL_ORGANIZATION_ID")) > 1,
lit("UNDEFINED")).otherwise(col("EXTERNAL_ORGANIZATION_ID").getItem(0)
)
)
方法getItem
在后台执行大多数条件。如果值集为空,则将返回null
,如果只有1个单个值,则将返回该值。