Question

给定Spark DataFrame df，我想在某个数字列'values'中找到最大值，并获取达到该值的行。我当然可以这样做：

# it doesn't matter if I use scala or python, 
# since I hope I get this done with DataFrame API
import pyspark.sql.functions as F
max_value = df.select(F.max('values')).collect()[0][0]
df.filter(df.values == max_value).show()

但这样效率很低，因为它需要两遍df。

pandas.Series / DataFrame和numpy.array有argmax / idxmax方法可以有效地执行此操作（一次通过）。标准python也是如此（内置函数max接受一个关键参数，因此它可用于查找最高值的索引）。

Spark的正确方法是什么？请注意，我不介意我是否获得了达到最大值的所有行，或者只是获得了那些行的任意（非空！）子集。

Answer 1

如果schema is Orderable（schema只包含atomics / atomics /递归orderable结构的数组），你可以使用简单的聚合：

<强>的Python ：

df.select(F.max(
    F.struct("values", *(x for x in df.columns if x != "values"))
)).first()

<强> Scala的：

df.select(max(struct(
    $"values" +: df.columns.collect {case x if x!= "values" => col(x)}: _*
))).first

否则你可以减少超过Dataset（仅限Scala），但它需要额外的反序列化：

type T = ???

df.reduce((a, b) => if (a.getAs[T]("values") > b.getAs[T]("values")) a else b)

您还可以oredrBy和limit(1) / take(1)：

<强> Scala的：

df.orderBy(desc("values")).limit(1)
// or
df.orderBy(desc("values")).take(1)

<强>的Python ：

df.orderBy(F.desc('values')).limit(1)
# or
df.orderBy(F.desc("values")).take(1)

Answer 2

也许这是一个不完整的答案，但您可以使用DataFrame的内部RDD，应用max方法并使用确定的密钥获取最大记录

a = sc.parallelize([
    ("a", 1, 100),
    ("b", 2, 120),
    ("c", 10, 1000),
    ("d", 14, 1000)
  ]).toDF(["name", "id", "salary"])

a.rdd.max(key=lambda x: x["salary"]) # Row(name=u'c', id=10, salary=1000)

Spark DataFrames中的argmax：如何检索具有最大值的行

2 个答案: