Pyspark - 将<list>类型的列映射到list <list>&gt;类型的列。基于来自不同列的标准

时间:2018-02-26 04:33:12

标签: python apache-spark pyspark mapping

我很难准确描述我想要的东西。这是一个例子。

从这里开始:

name   sport                                    position
Bob    [‘basketball’,’basketball’,’football’]   [‘PG’,’SG’,’QB’]
Jon    [‘hockey’, ‘football’, ‘football’]       [‘LW’, ‘WR’ , ‘TE’]
Tim    [‘baseball’, ‘basketball’]               [‘1B’, ‘PG’] 

到此:

name   sport                                    position
Bob    [‘basketball’,’football’]                [ [‘PG’,’SG’],’QB’]
Jon    [‘hockey’,  ‘football’]                  [‘LW’,[‘WR’ , ‘TE’]] 
Tim    [‘baseball’, ‘basketball’]               [‘1B’, ‘PG’] 

我想过做一个“爆炸”的事情。在&#39;位置操作&#39;然后是&#39; sport&#39;,然后是&#39; groupBy&#39;和一个&#39; agg&#39;但这会产生许多不需要的行,然后我需要进行进一步的过滤。

是否有任何可用于生成“#”位置的映射技术?列?

(要获得一个新的&#39;运动&#39;专栏,我需要删除重复项目)

1 个答案:

答案 0 :(得分:0)

您可以通过编写udf函数

来达到您的要求
from itertools import groupby
def function(x, y):
    keys = []
    values = []
    for key, group in groupby(zip(x, y), lambda x: x[0]):
        keys.append(key)
        values.append([z[1] for z in list(group)])
    return [keys, values]

udfFunctionCall = F.udf(function, T.ArrayType(T.StringType()))

调用udf函数创建new

df = df.withColumn("new", udfFunctionCall(df['sport'], df['position']))

select必要的列

df.select(df.name, df.new[0].alias("sport"), df.new[1].alias("position")).show(truncate=False)

您应该将所需的输出dataframe设为

+----+----------------------+----------------+
|name|sport                 |position        |
+----+----------------------+----------------+
|Bob |[basketball, football]|[[PG, SG], [QB]]|
|Jon |[hockey, football]    |[[LW], [WR, TE]]|
|Tim |[baseball, basketball]|[[1B], [PG]]    |
+----+----------------------+----------------+