我目前在python中学习火花。我有一个小问题,在其他语言如SQL中,我们可以简单地按指定的列对表进行分组,然后对它们执行诸如sum,count等的进一步操作。我们如何在Spark中做到这一点?
我的架构如:
[name:"ABC", city:"New York", money:"50"]
[name:"DEF", city:"London", money:"10"]
[name:"ABC", city:"New York", money:"30"]
[name:"XYZ", city:"London", money:"20"]
[name:"XYZ", city:"London", money:"100"]
[name:"DEF", city:"London", money:"200"]
让我们说我想按城市分组,然后为每个名字执行一笔钱。类似的东西:
New York ABC 80
London DEF 210
London XYZ 120
答案 0 :(得分:2)
您可以使用SQL:
>>> sc.parallelize([
... {"name": "ABC", "city": "New York", "money":"50"},
... {"name": "DEF", "city": "London", "money":"10"},
... {"name": "ABC", "city": "New York", "money":"30"},
... {"name": "XYZ", "city": "London", "money":"20"},
... {"name": "XYZ", "city": "London", "money":"100"},
... {"name": "DEF", "city": "London", "money":"200"},
... ]).toDF().registerTempTable("df")
>>> sqlContext.sql("""SELECT name, city, sum(cast(money as bigint)) AS total
... FROM df GROUP name, city""")
答案 1 :(得分:2)
您也可以以Pythonic方式执行此操作(或者发布@LostInOverflow的SQL版本):
grouped = df.groupby('city', 'name').sum('money')
您的money
列似乎是字符串,因此您需要首先将其转换为int
(或以此方式加载):
df = df.withColumn('money', df['money'].cast('int'))
请记住,数据框是不可变的,因此这两者都要求您将它们分配给一个对象(即使它再次回到df
),然后使用show
你想看到结果。
编辑:我应该补充一点,你需要先创建一个数据帧。对于您的简单数据,它几乎与发布的SQL版本相同,但您将其分配给数据框对象而不是将其注册为表:
df = sc.parallelize([
{"name": "ABC", "city": "New York", "money":"50"},
{"name": "DEF", "city": "London", "money":"10"},
{"name": "ABC", "city": "New York", "money":"30"},
{"name": "XYZ", "city": "London", "money":"20"},
{"name": "XYZ", "city": "London", "money":"100"},
{"name": "DEF", "city": "London", "money":"200"},
]).toDF()