我有一个pyspark数据帧,其中两列是数组,并且具有一对一的对应关系(第一个数组的第一个元素映射到第二个数组的第一个元素,等等)。
然后我为两列创建一个带有udf函数的可能子集,并且我想爆炸以为每个子集创建新行。不幸的是,我无法同时爆炸两个列,因此我必须采取一种解决方法。
到目前为止,我的尝试是分别分解两列,以在新数据帧连接中给出唯一的ID,并希望每个子集与另一个数据帧上的对应子匹配。
让我详细说明一些代码:
from pyspark.sql.functions import udf, collect_list, collect_set
from itertools import combinations, chain
from pyspark.sql.functions import explode
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import ArrayType, StringType
#Create the datagframe
df = spark.createDataFrame( [(1,'a','aa'),(1, 'b','bb'), (2, 'c','cc'),
(2, 'd','dd'),(3, 'e','ee'),(4, 'f','ff'),(5, 'v','vv'),
(5, 'b','bb'),(6, 't','tt'),(6, 't','tt')
] , ["id", "colA","colB"])
df.show()
>>>
+---+----+----+
| id|colA|colB|
+---+----+----+
| 1| a| aa|
| 1| b| bb|
| 2| c| cc|
| 2| d| dd|
| 3| e| ee|
| 4| f| ff|
| 5| v| vv|
| 5| b| bb|
| 6| t| tt|
| 6| t| tt|
+---+----+----+
#Group by and colect
df = df.groupBy(df.id).agg(collect_list("colA").alias("colAList"),collect_list("colB").alias("colBList"))
df.show()
>>>
+---+--------+--------+
| id|colAList|colBList|
+---+--------+--------+
| 6| [t, t]|[tt, tt]|
| 5| [v, b]|[vv, bb]|
| 1| [a, b]|[aa, bb]|
| 3| [e]| [ee]|
| 2| [c, d]|[cc, dd]|
| 4| [f]| [ff]|
+---+--------+--------+
#Create all possible subsets for colaA , colB with a udf
allsubsets = lambda l: [[z for z in y] for y in chain(*[combinations(l , n) for n in range(1,len(l)+1)])]
#Create all possible subsets for each column seperately
df = df.withColumn('colAsubsets',udf(allsubsets,ArrayType(ArrayType(StringType())))(df['colAList']))
df = df.withColumn('colBsubsets',udf(allsubsets,ArrayType(ArrayType(StringType())))(df['colBList']))
df.show()
>>>
+---+--------+--------+------------------------------------------------------+----------------------------------------------------------+
|id |colAList|colBList|colAsubsets |colBsubsets |
+---+--------+--------+------------------------------------------------------+----------------------------------------------------------+
|6 |[t, t] |[tt, tt]|[WrappedArray(t), WrappedArray(t), WrappedArray(t, t)]|[WrappedArray(tt), WrappedArray(tt), WrappedArray(tt, tt)]|
|5 |[v, b] |[vv, bb]|[WrappedArray(v), WrappedArray(b), WrappedArray(v, b)]|[WrappedArray(vv), WrappedArray(bb), WrappedArray(vv, bb)]|
|1 |[a, b] |[aa, bb]|[WrappedArray(a), WrappedArray(b), WrappedArray(a, b)]|[WrappedArray(aa), WrappedArray(bb), WrappedArray(aa, bb)]|
|3 |[e] |[ee] |[WrappedArray(e)] |[WrappedArray(ee)] |
|2 |[c, d] |[cc, dd]|[WrappedArray(c), WrappedArray(d), WrappedArray(c, d)]|[WrappedArray(cc), WrappedArray(dd), WrappedArray(cc, dd)]|
|4 |[f] |[ff] |[WrappedArray(f)] |[WrappedArray(ff)] |
+---+--------+--------+------------------------------------------------------+----------------------------------------------------------+
使用monotonically_increasing_id,我分别将各列爆炸,然后尝试将它们加入
dfEx = df.select(monotonically_increasing_id().alias('mid'),'id','colAList','colBList', explode('colAsubsets'))
dfEx = dfEx.withColumnRenamed('col','col_dfEx')
dfEx2 = df.select(monotonically_increasing_id().alias('mid'),'id','colAList','colBList', explode('colBsubsets'))
dfEx2 = dfEx2.withColumnRenamed('col','col_dfEx2')
#Join and hope we have a match
dj = dfEx.join(dfEx2, dfEx.mid == dfEx2.mid, 'inner').drop(dfEx2.mid)
dj.select([dfEx.col_dfEx2,dfEx2.col_dfEx2]).show()
>>>
+-------+--------+
|coldfEx|coldfEx2|
+-------+--------+
| [t]| [tt]|
| [t]| [tt]|
| [t, t]|[tt, tt]|
| [v]| [vv]|
| [b]| [bb]|
| [v, b]|[vv, bb]|
| [a]| [aa]|
| [b]| [bb]|
| [a, b]|[aa, bb]|
| [e]| [ee]|
| [c]| [cc]|
| [d]| [dd]|
| [c, d]|[cc, dd]|
| [f]| [ff]|
+-------+--------+
这是想要的结果。
当我在具有百万记录的日期框架中尝试此代码时,某些记录的映射是错误的。
所以我想问一下我的代码中是否有东西? 在所有情况下爆炸后,monotonically_increasing_id是否可以保证两个数据帧具有相同的id?
注意:这是此question
的下一步