Question

我有列soda 2（字符串），soda 3（字符串）和X（浮动）。

我想

汇总于X
取最大列Z
报告X，Y和Z列的所有值

如果列Y的多个值对应于列Z的最大值，则取Y列中这些值的最大值。

例如，我的表格如下：Y：

导致：

table1

如果我使用SQL，我会这样做：

col X col Y col Z
A     1     5
A     2     10
A     3     10
B     5     15

但是，当1）Z列是浮点数时，我该怎么做呢？ 2）我使用pyspark sql？

Answer 1

以下两个解决方案都在Scala中，但老实说无法拒绝发布它们来宣传我心爱的窗口聚合函数。遗憾。

唯一的问题是哪种结构化查询更高效/更有效？

窗口聚合函数：rank

val df = Seq(
  ("A",1,5),
  ("A",2,10),
  ("A",3,10),
  ("B",5,15)
).toDF("x", "y", "z")

scala> df.show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  1|  5|
|  A|  2| 10|
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

// describe window specification
import org.apache.spark.sql.expressions.Window
val byX = Window.partitionBy("x").orderBy($"z".desc).orderBy($"y".desc)

// use rank to calculate the best X
scala> df.withColumn("rank", rank over byX)
  .select("x", "y", "z")
  .where($"rank" === 1) // <-- take the first row
  .orderBy("x")
  .show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

Window Aggregate Function：first和dropDuplicates

我一直在考虑rank功能的替代方案，first通常会出现这种情况。

// use first and dropDuplicates
scala> df.
  withColumn("y", first("y") over byX).
  withColumn("z", first("z") over byX).
  dropDuplicates.
  orderBy("x").
  show
+---+---+---+
|  x|  y|  z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

Answer 2

您可以考虑使用Window功能。我的方法是创建Window函数，首先按X对数据帧进行分区。然后，按列值Y和Z排序。

我们只需选择rank == 1表示我们感兴趣的行。
或者我们可以使用first和drop_duplicates来完成相同的任务。

PS。感谢Jacek Laskowski的评论和Scala解决方案，以此解决方案。

创建玩具示例数据集

from pyspark.sql.window import Window
import pyspark.sql.functions as func

data=[('A',1,5),
      ('A',2,10),
      ('A',3,10),
      ('B',5,15)]
df = spark.createDataFrame(data,schema=['X','Y','Z'])

窗口聚合函数：rank

使用rank功能

应用Windows功能

w = Window.partitionBy(df['X']).orderBy([func.col('Y').desc(), func.col('Z').desc()])
df_max = df.select('X', 'Y', 'Z', func.rank().over(w).alias("rank"))
df_final = df_max.where(func.col('rank') == 1).select('X', 'Y', 'Z').orderBy('X')
df_final.show()

<强>输出

+---+---+---+
|  X|  Y|  Z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

窗口聚合函数：first和drop_duplicates

使用first和drop_duplicates如下

也可以完成此任务

df_final = df.select('X', func.first('Y').over(w).alias('Y'), func.first('Z').over(w).alias('Z'))\
    .drop_duplicates()\
    .orderBy('X')
df_final.show()

<强>输出

+---+---+---+
|  X|  Y|  Z|
+---+---+---+
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

Answer 3

让我们从您的样本数据创建一个数据框 -

<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<div class="progress">
    <div id="bar-one" style="width:30%"></div>
</div>
<br>
<div class="progress">
    <div id="bar-two" style="width:45%"></div>
</div>
<br>
<input id="one" type="checkbox">
<input id="two" type="checkbox">
<input id="three" type="checkbox">

输出：

data=[('A',1,5),
('A',2,10),
('A',3,10),
('B',5,15)]

df = spark.createDataFrame(data,schema=['X','Y','Z'])
df.show()

+---+---+---+
|  X|  Y|  Z|
+---+---+---+
|  A|  1|  5|
|  A|  2| 10|
|  A|  3| 10|
|  B|  5| 15|
+---+---+---+

# create a intermediate dataframe that find max of Z 
df1 = df.groupby('X').max('Z').toDF('X2','max_Z')

 # create 2nd intermidiate dataframe that finds max of Y where Z =  max of Z

 df2 = df.join(df1,df.X==df1.X2)\
        .where(col('Z')==col('max_Z'))\
        .groupBy('X')\
        .max('Y').toDF('X','max_Y')

# join above two to form final result

result = df1.join(df2,df1.X2==df2.X)\
            .select('X','max_Y','max_Z')\
            .orderBy('X')

result.show()

如何聚合一列并占用pyspark中的其他列？

3 个答案:

窗口聚合函数：rank

Window Aggregate Function：first和dropDuplicates

创建玩具示例数据集

窗口聚合函数：rank

窗口聚合函数：first和drop_duplicates