我很难准确描述我想要的东西。这是一个例子。
从这里开始:
name sport position
Bob [‘basketball’,’basketball’,’football’] [‘PG’,’SG’,’QB’]
Jon [‘hockey’, ‘football’, ‘football’] [‘LW’, ‘WR’ , ‘TE’]
Tim [‘baseball’, ‘basketball’] [‘1B’, ‘PG’]
到此:
name sport position
Bob [‘basketball’,’football’] [ [‘PG’,’SG’],’QB’]
Jon [‘hockey’, ‘football’] [‘LW’,[‘WR’ , ‘TE’]]
Tim [‘baseball’, ‘basketball’] [‘1B’, ‘PG’]
我想过做一个“爆炸”的事情。在'位置操作'然后是' sport',然后是' groupBy'和一个' agg'但这会产生许多不需要的行,然后我需要进行进一步的过滤。
是否有任何可用于生成“#”位置的映射技术?列?
(要获得一个新的'运动'专栏,我需要删除重复项目)
答案 0 :(得分:0)
您可以通过编写udf
函数
from itertools import groupby
def function(x, y):
keys = []
values = []
for key, group in groupby(zip(x, y), lambda x: x[0]):
keys.append(key)
values.append([z[1] for z in list(group)])
return [keys, values]
udfFunctionCall = F.udf(function, T.ArrayType(T.StringType()))
调用udf
函数创建new
列
df = df.withColumn("new", udfFunctionCall(df['sport'], df['position']))
select
必要的列
df.select(df.name, df.new[0].alias("sport"), df.new[1].alias("position")).show(truncate=False)
您应该将所需的输出dataframe
设为
+----+----------------------+----------------+
|name|sport |position |
+----+----------------------+----------------+
|Bob |[basketball, football]|[[PG, SG], [QB]]|
|Jon |[hockey, football] |[[LW], [WR, TE]]|
|Tim |[baseball, basketball]|[[1B], [PG]] |
+----+----------------------+----------------+