我正在访问spark数据帧(服务器具有spark 2.3,对此我无能为力)。我正在尝试获取其ID可以随时间变化(递归)的对象的原始标识符。我认为最好通过一个示例对此进行解释:
val df=spark.sql("values('AA','AB'),('BA','BA'),('AB','AC'),('CA','CA'),('BA','BB'),('AC','AD')").withColumnRenamed("col1","ID").withColumnRenamed("col2","NEW_ID")
+------+---------+
|ID | NEW_ID |
+------+---------+
| AA | AB | (AA changes to AB
| BA | BA | (no change in ID for now)
| AB | AC | (AB, whose "father" was AA, changes to AC)
| CA | CA | (no change)
| BA | BB | (BA changes to BB)
| AC | AD | (AC, whose "grandfather" was AA, changes to AD)
output
+------+---------+--------------+
|ID | NEW_ID | ORIGINAL_ID |
+------+---------+--------------+
| AA | AB | AA |
| BA | BA | BA |
| AB | AC | AA |
| CA | CA | CA |
| BA | BB | BA |
| AC | AD | AA |
因此,输出结果使我能够了解所使用的每个标识符的祖先标识符是什么。注意:给行的顺序是这样的,这样我总是可以在遇到新ID之前(之前)找到修改行(在实际数据帧中,有row_id可以用来排序)。
尽管我能够(使用数组和爆炸)将上一个和下一个标识符组合到一列中,但是此解决方案的问题在于它仅使我能够使用2级ID更改进行分组。由于我可以进行的更改数量没有限制,所以不能解决我的问题:
df.withColumn("agg",explode(array($"ID",$"NEW_ID")))
+---+------+---+
| ID|NEW_ID|agg|
+---+------+---+
| AA| AB| AA|
| AA| AB| AB|
| BA| BA| BA|
| BA| BA| BA|
| AB| AC| AB|
| AB| AC| AC|
| CA| CA| CA|
| CA| CA| CA|
| BA| BB| BA|
| BA| BB| BB|
| AC| AD| AC|
| AC| AD| AD|
+---+------+---+
这种方法是两级方法,其局限性如下所示,其中“ AC”不能返回其“祖父” ID:
df.withColumn("agg",explode(array($"ID",$"NEW_ID"))).groupBy("agg").agg(collect_set("ID").as("orig")).show
+---+--------+
|agg| orig|
+---+--------+
| AA| [AA]|
| AD| [AC]|
| CA| [CA]|
| BA| [BA]|
| AB|[AB, AA]|
| AC|[AB, AC]|
| BB| [BA]|
+---+--------+