我有一个庞大的pyspark数据帧。我必须执行一组但是我遇到严重的性能问题。我需要优化代码,所以我一直在阅读Reduce by Key更有效率。
这是数据框的一个例子。
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
输出:
+------+------+---------+-------------+
|Person|Amount| Budget| Date|
+------+------+---------+-------------+
| Bob| 562| Food| 12 May 2018|
| Bob| 880| Food| 01 June 2018|
| Bob| 380|Household| 16 June 2018|
| Sue| 85|Household| 16 July 2018|
| Sue| 963|Household| 16 Sept 2018|
+------+------+---------+-------------+
我已经实现了以下代码,但是如前所述,实际数据框架很大。
df_grouped = df.groupby('person').agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data"))
输出:
+------+--------------------------------------------------------------------------------+
|person|data |
+------+--------------------------------------------------------------------------------+
|Sue |[[85,Household, 16 July 2018], [963,Household, 16 Sept 2018]] |
|Bob |[[562,Food,12 May 2018], [880,Food,01 June 2018], [380,Household, 16 June 2018]]|
+------+--------------------------------------------------------------------------------+
架构为:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Amount: long (nullable = true)
| | |-- Budget: string (nullable = true)
| | |-- Date: string (nullable = true)
我需要将组转换为按键减少,以便我可以创建与上面相同的模式。
答案 0 :(得分:0)
这个怎么样,
def flatten(l, ltypes=(tuple)):
ltype = type(l)
l = list(l)
i = 0
while i < len(l):
while isinstance(l[i], ltypes):
if not l[i]:
l.pop(i)
i -= 1
break
else:
l[i:i + 1] = l[i]
i += 1
return ltype(l)
def nested_change(item, func):
if isinstance(item, list):
return [nested_change(x, func) for x in item]
return func(item)
def convert(*args):
return args
df_final = df.rdd.map(lambda x: ((x['Person']),([x[y] for y in cols if y != 'Person']))).reduceByKey(convert)\
.map(lambda x:(x[0],nested_change(list(flatten(x[1])),str)))\
.toDF(['person','data'])
df_final.show()