假设我有一个这样的数据框:
+-----------+-----------+-----------+-----------+------------+--+
| ColA | ColB | ColC | ColD | ColE | |
+-----------+-----------+-----------+-----------+------------+--+
| '' | sample_1x | sample_1y | '' | sample_1z | |
| sample2_x | sample2_y | '' | '' | '' | |
| sample3_x | '' | '' | '' | sample3_y | |
| sample4_x | sample4_y | '' | sample4_z | sample4_zz | |
| sample5_x | '' | '' | '' | '' | |
+-----------+-----------+-----------+-----------+------------+--+
我想创建另一个数据框,以显示每行中从左到右的关系,同时跳过具有空值的列。同样,只有1个有效列记录的行也将被排除。例如:
+-----------+------------+-----------+
| From | To | Label |
+-----------+------------+-----------+
| sample1_x | sample1_y | ColB_ColC |
| sample1_y | sample1_z | ColC_ColE |
| sample2_x | sample2_y | ColA_ColB |
| sample3_x | sample3_y | ColA_ColE |
| sample4_x | sample4_y | ColA_ColB |
| sample4_y | sample4_z | ColB_ColD |
| sample4_z | sample4_zz | ColD_ColE |
+-----------+------------+-----------+
我在想方法是编写一个包含此逻辑的UDF,但我不确定如何返回一个全新的DF,因为我习惯于UDF只是在同一DF中创建另一列。或者,是否还有另一个spark函数比创建UDF更容易处理这种情况?如果需要的话,请使用pyspark。
答案 0 :(得分:2)
您可以使用udf,它使用一个数组参数并返回一个结构数组,例如:
from pyspark.sql import functions as F
df.show()
+---------+---------+---------+---------+----------+
| ColA| ColB| ColC| ColD| ColE|
+---------+---------+---------+---------+----------+
| null|sample_1x|sample_1y| null| sample_1z|
|sample2_x|sample2_y| null| null| null|
|sample3_x| null| null| null| sample3_y|
|sample4_x|sample4_y| null|sample4_z|sample4_zz|
|sample5_x| null| null| null| null|
+---------+---------+---------+---------+----------+
# columns that get involved, will group them into an array using F.array(cols)
cols = df.columns
# defind function to convert array into array of structs
def find_route(arr, cols):
d = [ (cols[i],e) for i,e in enumerate(arr) if e is not None ]
return [ {'From':d[i][1], 'To':d[i+1][1], 'Label':d[i][0]+'_'+d[i+1][0]} for i in range(len(d)-1) ]
# set up the UDF and add cols as an extra argument
udf_find_route = F.udf(lambda a: find_route(a, cols), 'array<struct<From:string,To:string,Label:string>>')
# retrive the data from the array of structs after array-explode
df.select(F.explode(udf_find_route(F.array(cols))).alias('c1')).select('c1.*').show()
+---------+----------+---------+
| From| To| Label|
+---------+----------+---------+
|sample_1x| sample_1y|ColB_ColC|
|sample_1y| sample_1z|ColC_ColE|
|sample2_x| sample2_y|ColA_ColB|
|sample3_x| sample3_y|ColA_ColE|
|sample4_x| sample4_y|ColA_ColB|
|sample4_y| sample4_z|ColB_ColD|
|sample4_z|sample4_zz|ColD_ColE|
+---------+----------+---------+
答案 1 :(得分:0)
主要使用Spark SQL:
df.createOrReplaceTempView("df")
cols_df = df.columns
qry = " union ".join([f"""
select {enum_cols[1]} as From,
{cols_df[enum_cols[0] + 1]} as To,
'{enum_cols[1]}{cols_df[enum_cols[0] + 1]}' as Label from df where {enum_cols[1]} <> '' and {cols_df[enum_cols[0] + 1]} <> ''"""
for enum_cols in enumerate(cols_df) if enum_cols[0] < len(cols_df) - 1])
final_df = spark.sql(qry)