我想在列表中找到最大值。您如何在pyspark中做到这一点?
df = spark.createDataFrame([(1, [4,1]), (2, [4,5]), (3, [4,0])], ["A", "B"])
df.show()
+---+------+
| A| B|
+---+------+
| 1|[4, 1]|
| 2|[4, 5]|
| 3|[4, 0]|
+---+------+
在下面的示例中,如何在每行中从B列的列表中找到最大值。因此:
答案 0 :(得分:1)
您可以使用aggregate
功能。
df = spark.createDataFrame([(1, [4,1, 4, 54,4, 2,2, 7,14, 23,74,53]), (2, [4,5, 11, 3,45, 34, 2,3, 4]), (3, [4,0, 32, 23,23, 5,23,2 ,37,8, 6,54, 54])], ["A", "B"])
from pyspark.sql.functions import *
df.withColumn('Max', expr('aggregate(B, 0L, (a, b) -> if(a < b, b, a))')).show(3, False)
+---+----------------------------------------------+---+
|A |B |Max|
+---+----------------------------------------------+---+
|1 |[4, 1, 4, 54, 4, 2, 2, 7, 14, 23, 74, 53] |74 |
|2 |[4, 5, 11, 3, 45, 34, 2, 3, 4] |45 |
|3 |[4, 0, 32, 23, 23, 5, 23, 2, 37, 8, 6, 54, 54]|54 |
+---+----------------------------------------------+---+
请注意,0L
是长型,应该将类型与数组的元素匹配。
答案 1 :(得分:0)
这似乎有效。不知道到底是什么,但:-)
def my_max(s):
return max(s)
from pyspark.sql.types import DateType
my_max2 = F.udf(my_max, DateType())
df.withColumn("mymax", my_max2("B"))