Pyspark爆炸两个数组列,同时保留它们之间的地图

时间:2018-08-08 10:47:31

标签: apache-spark dataframe pyspark

我有一个pyspark数据帧,其中两列是数组,并且具有一对一的对应关系(第一个数组的第一个元素映射到第二个数组的第一个元素,等等)。

然后我为两列创建一个带有udf函数的可能子集,并且我想爆炸以为每个子集创建新行。不幸的是,我无法同时爆炸两个列,因此我必须采取一种解决方法。

到目前为止,我的尝试是分别分解两列,以在新数据帧连接中给出唯一的ID,并希望每个子集与另一个数据帧上的对应子匹配。

让我详细说明一些代码:

from pyspark.sql.functions import udf, collect_list, collect_set
from itertools import combinations, chain
from pyspark.sql.functions import explode
from pyspark.sql.functions import monotonically_increasing_id
from pyspark.sql.types import ArrayType, StringType

#Create the datagframe
df = spark.createDataFrame( [(1,'a','aa'),(1, 'b','bb'), (2, 'c','cc'),
(2, 'd','dd'),(3, 'e','ee'),(4, 'f','ff'),(5, 'v','vv'),
(5, 'b','bb'),(6, 't','tt'),(6, 't','tt')
] , ["id", "colA","colB"])

df.show()
>>>
+---+----+----+
| id|colA|colB|
+---+----+----+
|  1|   a|  aa|
|  1|   b|  bb|
|  2|   c|  cc|
|  2|   d|  dd|
|  3|   e|  ee|
|  4|   f|  ff|
|  5|   v|  vv|
|  5|   b|  bb|
|  6|   t|  tt|
|  6|   t|  tt|
+---+----+----+

#Group by and colect
df = df.groupBy(df.id).agg(collect_list("colA").alias("colAList"),collect_list("colB").alias("colBList"))

df.show()
>>>
+---+--------+--------+
| id|colAList|colBList|
+---+--------+--------+
|  6|  [t, t]|[tt, tt]|
|  5|  [v, b]|[vv, bb]|
|  1|  [a, b]|[aa, bb]|
|  3|     [e]|    [ee]|
|  2|  [c, d]|[cc, dd]|
|  4|     [f]|    [ff]|
+---+--------+--------+


#Create all possible subsets for colaA , colB with a udf
allsubsets = lambda l: [[z for z in y] for y in chain(*[combinations(l , n) for n in range(1,len(l)+1)])]

#Create all possible subsets for each column seperately
df = df.withColumn('colAsubsets',udf(allsubsets,ArrayType(ArrayType(StringType())))(df['colAList']))
df = df.withColumn('colBsubsets',udf(allsubsets,ArrayType(ArrayType(StringType())))(df['colBList']))

df.show()
>>>
+---+--------+--------+------------------------------------------------------+----------------------------------------------------------+
|id |colAList|colBList|colAsubsets                                           |colBsubsets                                               |
+---+--------+--------+------------------------------------------------------+----------------------------------------------------------+
|6  |[t, t]  |[tt, tt]|[WrappedArray(t), WrappedArray(t), WrappedArray(t, t)]|[WrappedArray(tt), WrappedArray(tt), WrappedArray(tt, tt)]|
|5  |[v, b]  |[vv, bb]|[WrappedArray(v), WrappedArray(b), WrappedArray(v, b)]|[WrappedArray(vv), WrappedArray(bb), WrappedArray(vv, bb)]|
|1  |[a, b]  |[aa, bb]|[WrappedArray(a), WrappedArray(b), WrappedArray(a, b)]|[WrappedArray(aa), WrappedArray(bb), WrappedArray(aa, bb)]|
|3  |[e]     |[ee]    |[WrappedArray(e)]                                     |[WrappedArray(ee)]                                        |
|2  |[c, d]  |[cc, dd]|[WrappedArray(c), WrappedArray(d), WrappedArray(c, d)]|[WrappedArray(cc), WrappedArray(dd), WrappedArray(cc, dd)]|
|4  |[f]     |[ff]    |[WrappedArray(f)]                                     |[WrappedArray(ff)]                                        |
+---+--------+--------+------------------------------------------------------+----------------------------------------------------------+

使用monotonically_increasing_id,我分别将各列爆炸,然后尝试将它们加入

dfEx = df.select(monotonically_increasing_id().alias('mid'),'id','colAList','colBList', explode('colAsubsets'))
dfEx = dfEx.withColumnRenamed('col','col_dfEx')

dfEx2 = df.select(monotonically_increasing_id().alias('mid'),'id','colAList','colBList', explode('colBsubsets'))
dfEx2 = dfEx2.withColumnRenamed('col','col_dfEx2')

#Join and hope we have a match
dj = dfEx.join(dfEx2, dfEx.mid == dfEx2.mid, 'inner').drop(dfEx2.mid)
dj.select([dfEx.col_dfEx2,dfEx2.col_dfEx2]).show()
>>>
+-------+--------+
|coldfEx|coldfEx2|
+-------+--------+
|    [t]|    [tt]|
|    [t]|    [tt]|
| [t, t]|[tt, tt]|
|    [v]|    [vv]|
|    [b]|    [bb]|
| [v, b]|[vv, bb]|
|    [a]|    [aa]|
|    [b]|    [bb]|
| [a, b]|[aa, bb]|
|    [e]|    [ee]|
|    [c]|    [cc]|
|    [d]|    [dd]|
| [c, d]|[cc, dd]|
|    [f]|    [ff]|
+-------+--------+

这是想要的结果。

当我在具有百万记录的日期框架中尝试此代码时,某些记录的映射是错误的。

所以我想问一下我的代码中是否有东西? 在所有情况下爆炸后,monotonically_increasing_id是否可以保证两个数据帧具有相同的id?

注意:这是此question

的下一步

0 个答案:

没有答案