将pyspark数据帧行值转换为同一行中其他元素的关系

时间:2018-05-05 09:22:12

标签: python apache-spark pyspark pyspark-sql

我试图将spark数据帧行值转换为同一行的每个其他值的关系。我计划通过维护所有行元素的列表并将其映射到单个行值来实现此目的。以下示例更好地说明了

输入数据框

>>> df = spark.createDataFrame([('1111','1010', 'aaaa'), ('2222','2020', 'bbbb'), ('3333','3030', 'cccc')], ['company_id', 'client_id', 'partner_id'])
>>> df.show()
+----------+---------+----------+
|company_id|client_id|partner_id|
+----------+---------+----------+
|      1111|     1010|      aaaa|
|      2222|     2020|      bbbb|
|      3333|     3030|      cccc|
+----------+---------+----------+

预期输出

+------+------------------+
|entity|         relations|
+------+------------------+
|  1111|[1111, 1010, aaaa]|
|  2222|[2222, 2020, bbbb]|
|  3333|[3333, 3030, cccc]|
|  1010|[1111, 1010, aaaa]|
|  2020|[2222, 2020, bbbb]|
|  3030|[3333, 3030, cccc]|
|  aaaa|[1111, 1010, aaaa]|
|  bbbb|[2222, 2020, bbbb]|
|  cccc|[3333, 3030, cccc]|
+------+------------------+

我已经实现了以下代码,可以实现预期的输出。但是这个实际数据帧中的数据预计会非常庞大​​,所以只想检查是否有更好的方法来解决这个问题。

我的实施

from functools import reduce
from pyspark.sql import DataFrame
import pyspark.sql.functions as F

def unionAll(*dfs):
   return reduce(DataFrame.unionAll, dfs)

df = spark.createDataFrame([('1111','1010', 'aaaa'), ('2222','2020', 'bbbb'), ('3333','3030', 'cccc')], ['company_id', 'client_id', 'partner_id'])
company_df = df.select(df.company_id.alias('entity'), F.array(df.company_id, df.client_id, df.partner_id).alias('relations'))
client_df = df.select(df.client_id.alias('entity'), F.array(df.company_id, df.client_id, df.partner_id).alias('relations'))
partner_df = df.select(df.partner_id.alias('entity'), F.array(df.company_id, df.client_id, df.partner_id).alias('relations'))
entity_df = unionAll(company_df, client_df, partner_df)
entity_df.show()
+------+------------------+
|entity|         relations|
+------+------------------+
|  1111|[1111, 1010, aaaa]|
|  2222|[2222, 2020, bbbb]|
|  3333|[3333, 3030, cccc]|
|  1010|[1111, 1010, aaaa]|
|  2020|[2222, 2020, bbbb]|
|  3030|[3333, 3030, cccc]|
|  aaaa|[1111, 1010, aaaa]|
|  bbbb|[2222, 2020, bbbb]|
|  cccc|[3333, 3030, cccc]|
+------+------------------+

2 个答案:

答案 0 :(得分:1)

请试一试。你只需要用这些列表创建另一个列,就是这样。然后你甚至可以放弃任何想要掉落的东西。它在计算上比你的代码便宜:

df = spark.createDataFrame([('1111','1010', 'aaaa'), ('2222','2020', 'bbbb'), ('3333','3030', 'cccc')], ['company_id', 'client_id', 'partner_id'])

df = df.withColumn('relation', [df['company_id'], df['client_id'], df['partner_id']])

答案 1 :(得分:0)

我对以前的实施进行了改进,如下所示,有人认为它有用。

from functools import reduce
from pyspark.sql import DataFrame
import pyspark.sql.functions as F
import pyspark.sql.types as T

to_list = F.udf(lambda *x: filter(None, x), T.ArrayType(T.StringType()))
df = spark.createDataFrame([('1111','1010', 'aaaa'), ('2222','2020', 'bbbb'), ('3333','3030', 'cccc')], ['company_id', 'client_id', 'partner_id'])
df = df.withColumn('relations', to_list(df['company_id'], df['client_id'], df['partner_id']))
transformed_df = df.select(F.explode(to_list(df.company_id, df.client_id, df.partner_id)).alias('entity'), df.relations)

transformed_df.show()

+------+------------------+
|entity|          relation|
+------+------------------+
|  1111|[1111, 1010, aaaa]|
|  1010|[1111, 1010, aaaa]|
|  aaaa|[1111, 1010, aaaa]|
|  2222|[2222, 2020, bbbb]|
|  2020|[2222, 2020, bbbb]|
|  bbbb|[2222, 2020, bbbb]|
|  3333|[3333, 3030, cccc]|
|  3030|[3333, 3030, cccc]|
|  cccc|[3333, 3030, cccc]|
+------+------------------+