分组并过滤Pyspark数据帧

时间:2019-10-04 17:20:13

标签: python dataframe pyspark

我有3列的PySpark数据帧。某些行在2列中相似,但在第三列中却不相同,请参见下面的示例。

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[2,3]           |
---------------------------------------- 
Joe        | Smith     |[2,3,5,6]       |
---------------------------------------- 
Jim        | Bush      |[9,7]           |
---------------------------------------- 
Jim        | Bush      |[21]            |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
----------------------------------------   

我想根据{first_name,last_name}列对行进行分组,并且仅使行的{requests_ID}最大。所以结果应该是:

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[2,3,5,6]       |
---------------------------------------- 
Jim        | Bush      |[9,7]           |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
---------------------------------------- 

我尝试了以下类似的操作,但是它给了我分组依据中两个行的嵌套数组,而不是最长的行。

gr_df = filtered_df.groupBy("first_name", "last_name").agg(F.collect_set("requests_ID").alias("requests_ID")) 

这是我得到的结果:

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[[9,7],[2,3,5,6]]|
---------------------------------------- 
Jim        | Bush      |[[9,7],[21]]    |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
---------------------------------------- 

2 个答案:

答案 0 :(得分:1)

要按照当前的df进行操作,

----------------------------------------
first_name | last_name | requests_ID    |
----------------------------------------
Joe        | Smith     |[[9,7],[2,3,5,6]]|
---------------------------------------- 
Jim        | Bush      |[[9,7],[21]]    |
---------------------------------------- 
Sarah      | Wood      |[2,3]           |
---------------------------------------- 

尝试一下

import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType, ArrayType

def myfunc(x):
  temp = []
  for _ in x:
    temp.append(len(x))

  max_ind = temp.index(max(temp))

  return x[max_ind]

udf_extract = F.udf(myfunc, ArrayType(IntegerType()))

df = df.withColumn('new_requests_ID', udf_extract('requests_ID'))

#df.show()

或者,没有变量声明,

import pyspark.sql.functions as F

@F.udf
def myfunc(x):
  temp = []
  for _ in x:
    temp.append(len(x))

  max_ind = temp.index(max(temp))

  return x[max_ind]

df = df.withColumn('new_requests_ID', myfunc('requests_ID'))

#df.show()

答案 1 :(得分:1)

您可以使用size确定数组列的长度,并使用window,如下所示:

导入并创建示例DataFrame

import pyspark.sql.functions as f
from pyspark.sql.window import Window

df = spark.createDataFrame([('Joe','Smith',[2,3]),
('Joe','Smith',[2,3,5,6]),
('Jim','Bush',[9,7]),
('Jim','Bush',[21]),
('Sarah','Wood',[2,3])], ('first_name','last_name','requests_ID'))

定义窗口以根据列的长度降序获得requests_ID列的行号。

在这里,f.size("requests_ID")将给出requests_ID列的长度,而desc()将对其进行降序排序。

w_spec = Window().partitionBy("first_name", "last_name").orderBy(f.size("requests_ID").desc())

应用窗口功能并获得第一行。

df.withColumn("rn", f.row_number().over(w_spec)).where("rn ==1").drop("rn").show()
+----------+---------+------------+
|first_name|last_name| requests_ID|
+----------+---------+------------+
|       Jim|     Bush|      [9, 7]|
|     Sarah|     Wood|      [2, 3]|
|       Joe|    Smith|[2, 3, 5, 6]|
+----------+---------+------------+