Question

我有一个如下所示的数据框：

items_df
======================================================
| customer   item_type    brand    price    quantity |  
|====================================================|
|  1         bread        reems     20         10    |  
|  2         butter       spencers  10         21    |  
|  3         jam          niles     10         22    |
|  1         bread        marks     16         18    |
|  1         butter       jims      19         12    |
|  1         jam          jills     16         6     |
|  2         bread        marks     16         18    |
======================================================

我创建了一个将上述内容转换为字典的rdd：

rdd = items_df.rdd.map(lambda row: row.asDict())

结果如下：

[
   { "customer": 1, "item_type": "bread", "brand": "reems", "price": 20, "quantity": 10 },
   { "customer": 2, "item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21 },
   { "customer": 3, "item_type": "jam", "brand": "niles", "price": 10, "quantity": 22 },
   { "customer": 1, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 },
   { "customer": 1, "item_type": "butter", "brand": "jims", "price": 19, "quantity": 12 },
   { "customer": 1, "item_type": "jam", "brand": "jills", "price": 16, "quantity": 6 },
   { "customer": 2, "item_type": "bread", "brand": "marks", "price": 16, "quantity": 18 }
]

我想首先按客户对以上行进行分组。然后，我想介绍定制的新键“面包”，“黄油”，“果酱”，并为该客户分组所有这些行。所以我的rdd从7行减少到3行。

输出如下：

[
    { 
        "customer": 1, 
        "breads": [
            {"item_type": "bread", "brand": "reems", "price": 20, "quantity": 10},
            {"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18},
        ],
        "butters": [
            {"item_type": "butter", "brand": "jims", "price": 19, "quantity": 12}
        ],
        "jams": [
            {"item_type": "jam", "brand": "jills", "price": 16, "quantity": 6}
        ]
    },
    {
        "customer": 2,
        "breads": [
            {"item_type": "bread", "brand": "marks", "price": 16, "quantity": 18}
        ],
        "butters": [
            {"item_type": "butter", "brand": "spencers", "price": 10, "quantity": 21}
        ],
        "jams": []
    },
    {
        "customer": 3,
        "breads": [],
        "butters": [],
        "jams": [
            {"item_type": "jam", "brand": "niles", "price": 10, "quantity": 22}
        ]
    }
]

有人会知道如何使用PySpark实现上述目标吗？我想知道是否有使用reduceByKey（）或类似方法的解决方案。我希望尽可能避免使用groupByKey（）。

Answer 1

首先添加一列sass以透视数据框。

sass

然后，您可以将数据透视表与item_types组一起使用，并使用items_df = items_df.withColumn('item_types', F.concat(F.col('item_type'),F.lit('s'))) items_df.show() +--------+---------+--------+-----+--------+----------+ |customer|item_type| brand|price|quantity|item_types| +--------+---------+--------+-----+--------+----------+ | 1| bread| reems| 20| 10| breads| | 2| butter|spencers| 10| 21| butters| | 3| jam| niles| 10| 22| jams| | 1| bread| marks| 16| 18| breads| | 1| butter| jims| 19| 12| butters| | 1| jam| jills| 16| 6| jams| | 2| bread| marks| 16| 18| breads| +--------+---------+--------+-----+--------+----------+同时汇总其他列。

customer

最后，您需要设置F.collect_list()才能将嵌套的Row转换为dict。

items_df = items_df.groupby(['customer']).pivot("item_types").agg(
    F.collect_list(F.struct(F.col("item_type"),F.col("brand"), F.col("price"),F.col("quantity")))
).sort('customer')
items_df.show()

+--------+--------------------+--------------------+--------------------+
|customer|              breads|             butters|                jams|
+--------+--------------------+--------------------+--------------------+
|       1|[[bread, reems, 2...|[[butter, jims, 1...|[[jam, jills, 16,...|
|       2|[[bread, marks, 1...|[[butter, spencer...|                  []|
|       3|                  []|                  []|[[jam, niles, 10,...|
+--------+--------------------+--------------------+--------------------+

Answer 2

我在rdd中也使用了reduceByKey（）的另一种方法。给定数据框items_df，首先将其转换为rdd：

rdd = items_df.rdd.map(lambda row: row.asDict())

将每行转换为一个元组（客户[row_obj]），其中列表中包含row_obj：

rdd = rdd.map(lambda row: ( row["customer"], [row] ) )

使用reduceByKey按客户分组，其中列表是针对给定客户的：

rdd = rdd.reduceByKey(lambda x,y: x+y)

将元组转换回字典，其中键是客户，值是所有关联行的列表：

rdd = rdd.map(lambda tup: { tup[0]: tup[1] } )

由于每个客户数据都已排成一行，因此我们可以使用自定义功能将数据分离为面包，黄油和果酱：

def organize_items_in_customer(row):
    cust_id = list(row.keys())[0]
    items = row[cust_id]
    new_cust_obj = { "customer": cust_id, "breads": [], "butters": [], "jams": [] }
    plurals = { "bread":"breads", "butter":"butters", "jam":"jams" }
    for item in items:
        item_type = item["item_type"]
        key = plurals[item_type]
        new_cust_obj[key].append(item)
    return new_cust_obj

调用上面的函数来转换rdd：

rdd = rdd.map(organize_items_in_customer)

如何使用自定义的行分组在PySpark中的reduceByKey？

2 个答案: