在Spark中使用groupBy

时间:2016-07-14 14:04:14

标签: python apache-spark pyspark apache-spark-sql spark-dataframe

我目前在python中学习火花。我有一个小问题,在其他语言如SQL中,我们可以简单地按指定的列对表进行分组,然后对它们执行诸如sum,count等的进一步操作。我们如何在Spark中做到这一点?

我的架构如:

    [name:"ABC", city:"New York", money:"50"]
    [name:"DEF", city:"London", money:"10"]
    [name:"ABC", city:"New York", money:"30"]
    [name:"XYZ", city:"London", money:"20"]
    [name:"XYZ", city:"London", money:"100"]
    [name:"DEF", city:"London", money:"200"]

让我们说我想按城市分组,然后为每个名字执行一笔钱。类似的东西:

    New York ABC 80
    London DEF 210
    London XYZ 120

2 个答案:

答案 0 :(得分:2)

您可以使用SQL:

>>> sc.parallelize([
... {"name": "ABC", "city": "New York", "money":"50"},
... {"name": "DEF", "city": "London",   "money":"10"},
... {"name": "ABC", "city": "New York", "money":"30"},
... {"name": "XYZ", "city": "London",   "money":"20"},
... {"name": "XYZ", "city": "London",   "money":"100"},
... {"name": "DEF", "city": "London",   "money":"200"},
... ]).toDF().registerTempTable("df")

>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total 
... FROM df GROUP name, city""")

答案 1 :(得分:2)

您也可以以Pythonic方式执行此操作(或者发布@LostInOverflow的SQL版本):

grouped = df.groupby('city', 'name').sum('money')

您的money列似乎是字符串,因此您需要首先将其转换为int(或以此方式加载):

df = df.withColumn('money', df['money'].cast('int'))

请记住,数据框是不可变的,因此这两者都要求您将它们分配给一个对象(即使它再次回到df),然后使用show你想看到结果。

编辑:我应该补充一点,你需要先创建一个数据帧。对于您的简单数据,它几乎与发布的SQL版本相同,但您将其分配给数据框对象而不是将其注册为表:

df = sc.parallelize([
    {"name": "ABC", "city": "New York", "money":"50"},
    {"name": "DEF", "city": "London",   "money":"10"},
    {"name": "ABC", "city": "New York", "money":"30"},
    {"name": "XYZ", "city": "London",   "money":"20"},
    {"name": "XYZ", "city": "London",   "money":"100"},
    {"name": "DEF", "city": "London",   "money":"200"},
    ]).toDF()