我有以下数据框-
>>> my_df.show(3)
+------------+---------+-------+--------------+
| user_id| address| type|count| country|
+------------+---------+-------+-----+--------+
| ABC123| yyy,USA| animal| 2| USA|
| ABC123| xxx,USA| animal| 3| USA|
| qwerty| 55A,AUS| human| 3| AUS|
| ABC123| zzz,RSA| animal| 4| RSA|
+------------+---------+-------+--------------+
如何汇总此数据框以获得以下结果-
>>> new_df.show(3)
+------------+---------+-------+--------------+
| user_id| address| type|count| country|
+------------+---------+-------+-----+--------+
| qwerty| 55A,AUS| human| 3| AUS|
| ABC123| xxx,USA| animal| 5| USA|
+------------+---------+-------+--------------+
对于给定的user_id
:
country
country
,请获取计数最高的address
我猜测我必须将my_df
分成2个不同的数据帧,并分别获取country
和address
。但是我不完全知道该语法。感谢您的帮助。谢谢。
答案 0 :(得分:1)
我的意思是这样的:
>>> import pandas as pd
>>> from pyspark.sql.functions import *
>>> from pyspark.sql.window import *
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('abc').getOrCreate()
>>> data = {"user_id": ["ABC123", "ABC123", "qwerty", "ABC123"], "address": ["yyy,USA", "xxx,USA", "55A,AUS", "zzz,RSA"], "type": ["animal", "animal", "human", "animal"], "count": [2,3,3,4], "country": ["USA", "USA", "AUS", "RSA"]}
>>> df = pd.DataFrame(data=data)
>>> df_pyspark = spark.createDataFrame(df)
>>> w = Window().partitionBy("user_id", "country").orderBy((col("count").desc()))
>>> w2 = Window().partitionBy("user_id").orderBy(col("sum_country").desc())
>>> df_pyspark.select("user_id", "address", "type", "count", "country", sum("count").over(w).alias("sum_country")).select("user_id", first("country").over(w2).alias("top_country"), first("address").over(w).alias("top_address"), "country").where(col("top_country")==col("country")).distinct().show()
+-------+-----------+-----------+-------+
|user_id|top_country|top_address|country|
+-------+-----------+-----------+-------+
| qwerty| AUS| 55A,AUS| AUS|
| ABC123| USA| xxx,USA| USA|
+-------+-----------+-----------+-------+
您可以添加类型,计数等,具体取决于您要使用哪种逻辑-您可以执行与top_address
(即first
函数)相同的操作,或者可以groupBy
和agg