有一个数据框(列c到e最多有15个变体)
cola, colb, colc_1, cold_1, cole_1, colc_2, cold_2, cole_2...
1, 2, 3, 4, 5, 6, 7, 8
想要的数据框
cola, colb, new_col colc, cold, cole,
1, 2, _1, 3, 4, 5
1, 2, _2, 6, 7, 8
希望将colc转置为cole并使用这些列的后缀(_1,_2 ..._ 15)作为转置字段(new_col)的值
我能够在Pandas中使用熔体和枢轴执行此操作,但是此示例中的数据框太大,无法转换为Pandas df,需要在pyspark或aws胶中完成
答案 0 :(得分:1)
您可以尝试select()
和union()
。下面的代码首先列出了基本逻辑,然后使用reduce()
函数消除了所有中间数据帧:
from pyspark.sql import functions as F
from functools import reduce
df = spark.createDataFrame([
(1,2,3,4,5,6,7,8)
, (11,12,13,14,15,16,17,18)
, (21,22,23,24,25,26,27,28)
],
[ 'cola', 'colb'
, 'colc_1', 'cold_1', 'cole_1'
, 'colc_2', 'cold_2', 'cole_2'
])
# create df1 with all columns for new_col = '_1'
df1 = df.select('cola', 'colb', F.lit('_1'), 'colc_1', 'cold_1', 'cole_1')
df1.show()
#+----+----+---+------+------+------+
#|cola|colb| _1|colc_1|cold_1|cole_1|
#+----+----+---+------+------+------+
#| 1| 2| _1| 3| 4| 5|
#| 11| 12| _1| 13| 14| 15|
#| 21| 22| _1| 23| 24| 25|
#+----+----+---+------+------+------+
# do the similar for '_2'
df2 = df.select('cola', 'colb', F.lit('_2'), *["col{}_2".format(i) for i in list("cde")])
#+----+----+---+------+------+------+
#|cola|colb| _2|colc_2|cold_2|cole_2|
#+----+----+---+------+------+------+
#| 1| 2| _2| 6| 7| 8|
#| 11| 12| _2| 16| 17| 18|
#| 21| 22| _2| 26| 27| 28|
#+----+----+---+------+------+------+
# then union these two dataframe and adjust the final column names
df_new = df1.union(df2).toDF('cola', 'colb', 'new_col', 'colc', 'cold', 'cole')
df_new.show()
#+----+----+-------+----+----+----+
#|cola|colb|new_col|colc|cold|cole|
#+----+----+-------+----+----+----+
#| 1| 2| _1| 3| 4| 5|
#| 11| 12| _1| 13| 14| 15|
#| 21| 22| _1| 23| 24| 25|
#| 1| 2| _2| 6| 7| 8|
#| 11| 12| _2| 16| 17| 18|
#| 21| 22| _2| 26| 27| 28|
#+----+----+-------+----+----+----+
接下来,我们可以使用reduce()
函数来处理没有上述临时df1,df2等的所有列组:
# setup the list of columns to be normalized
normalize_cols = ["col{}".format(c) for c in list("cde")]
# ["colc", "cold", "cole"]
# change N to 16 to cover new_col from '_1' to '_15'
N = 3
# use reduce to handle all groups
df_new = reduce(
lambda d1,d2: d1.union(d2)
, [ df.select('cola', 'colb', F.lit('_{}'.format(i)), *["{}_{}".format(c,i) for c in normalize_cols]) for i in range(1,N) ]
).toDF('cola', 'colb', 'new_col', *normalize_cols)
另一种方法是使用F.array()
和F.explode()
(对所有_N
使用reduce()):
df.withColumn('d1', F.array(F.lit('_1'), *['col{}_1'.format(c) for c in list("cde")])) \
.withColumn('d2', F.array(F.lit('_2'), *['col{}_2'.format(c) for c in list("cde")])) \
.withColumn('h', F.array('d1', 'd2')) \
.withColumn('h1', F.explode('h')) \
.select('cola', 'colb', *[ F.col('h1')[i] for i in range(4)]) \
.toDF('cola', 'colb', 'new_col', 'colc', 'cold', 'cole') \
.show()
每条评论更新:
要对数据帧进行非规范化,我正在使用F.array()
,然后使用F.collect_list
将列分组为数组列表,然后从groupby()
结果中引用值:
使用Window函数设置元素在collect_list中的顺序:reference link
N = 3
normalize_cols = ["col{}".format(c) for c in list("cde")]
# win spec so that element in collect_list are sorted based on 'new_col'
win = Window.partitionBy('cola', 'colb').orderBy('new_col')
df_new.withColumn('cols', F.array(normalize_cols)) \
.withColumn('clist', F.collect_list('cols').over(win)) \
.groupby('cola', 'colb').agg(F.last('clist').alias('clist1')) \
.select('cola', 'colb', *[ F.col('clist1')[i].alias('c{}'.format(i)) for i in range(N-1)]) \
.select('cola', 'colb', *[ F.col('c{}'.format(i))[j].alias('{}_{}'.format(normalize_cols[j],i+1)) for i in range(N-1) for j in range(len(normalize_cols)) ]) \
.show()
# +----+----+------+------+------+------+------+------+
# |cola|colb|colc_1|cold_1|cole_1|colc_2|cold_2|cole_2|
# +----+----+------+------+------+------+------+------+
# | 11| 12| 13| 14| 15| 16| 17| 18|
# | 21| 22| 23| 24| 25| 26| 27| 28|
# | 1| 2| 3| 4| 5| 6| 7| 8|
# +----+----+------+------+------+------+------+------+
一些说明:
F.last()
在相同的partitionBy(groupby)下从Window函数返回完整的collect_list select()
将collect_list()转换为 c0 , c1 select()
将 c0 转换为colc_1,cold_1,cole_1和 c1 转换为colc_2,cold_2,cole_2