Question

我有一个庞大的pyspark数据帧。我必须执行一组但是我遇到严重的性能问题。我需要优化代码，所以我一直在阅读Reduce by Key更有效率。

这是数据框的一个例子。

a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"),  ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])

输出：

+------+------+---------+-------------+
|Person|Amount|   Budget|         Date|
+------+------+---------+-------------+
|   Bob|   562|     Food|  12 May 2018|
|   Bob|   880|     Food| 01 June 2018|
|   Bob|   380|Household| 16 June 2018|
|   Sue|    85|Household| 16 July 2018|
|   Sue|   963|Household| 16 Sept 2018|
+------+------+---------+-------------+

我已经实现了以下代码，但是如前所述，实际数据框架很大。

df_grouped = df.groupby('person').agg(F.collect_list(F.struct("Amount", "Budget", "Date")).alias("data"))

输出：

+------+--------------------------------------------------------------------------------+
|person|data                                                                            |
+------+--------------------------------------------------------------------------------+
|Sue   |[[85,Household, 16 July 2018], [963,Household, 16 Sept 2018]]                   |
|Bob   |[[562,Food,12 May 2018], [880,Food,01 June 2018], [380,Household, 16 June 2018]]|
+------+--------------------------------------------------------------------------------+

架构为：

root
 |-- person: string (nullable = true)
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Amount: long (nullable = true)
 |    |    |-- Budget: string (nullable = true)
 |    |    |-- Date: string (nullable = true)

我需要将组转换为按键减少，以便我可以创建与上面相同的模式。

Answer 1

这个怎么样，

def flatten(l, ltypes=(tuple)):
    ltype = type(l)
    l = list(l)
    i = 0
    while i < len(l):
        while isinstance(l[i], ltypes):
            if not l[i]:
                l.pop(i)
                i -= 1
                break
            else:
                l[i:i + 1] = l[i]
        i += 1
    return ltype(l)

def nested_change(item, func):
    if isinstance(item, list):
        return [nested_change(x, func) for x in item]
    return func(item)


def convert(*args):
    return args

df_final = df.rdd.map(lambda x: ((x['Person']),([x[y] for y in cols if y != 'Person']))).reduceByKey(convert)\
                 .map(lambda x:(x[0],nested_change(list(flatten(x[1])),str)))\
                 .toDF(['person','data'])

df_final.show()

将Group By转换为按键减少

1 个答案: