I have the following code with some aggregation function:
new_df = my_df.groupBy('id').agg({"id": "count", "money":"max"})
Then the new column I have are COUNT(id)
and MAX(money)
. Can I specify the column names myself instead of using the default one? E.g. I want them to be called my_count_id
and my_max_money
. How do I do that? Thanks!
答案 0 :(得分:2)
使用不是字典的列:
>>> from pyspark.sql.functions import *
>>> my_df.groupBy('id').agg(count("id").alias("some name"), max("money").alias("some other name"))
答案 1 :(得分:1)
可能是这样的:
new_df = my_df.groupBy('id') \
.agg({"id": "count", "money": "max"}) \
.withColumnRenamed("COUNT(id)", "my_count_id") \
.withColumnRenamed("MAX(money)", "my_max_money")
或:
import pyspark.sql.functions as func
new_df = my_df.groupBy('id') \
.agg(func.count("id").alias("my_count_id"),
func.max("money").alias("my_max_money"))