Question

我正在使用pyspark 2.3.1，并希望使用表达式而不是使用udf来过滤数组元素：

>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2|           col3|
+----+----+---------------+
|   1|   A|   [1, 2, 3, 4]|
|   2|   B|[1, 2, 3, 4, 5]|
+----+----+---------------+

下面显示的表达式是错误的，我想知道如何告诉spark从col3中的数组中删除小于3的任何值。我想要类似的东西：

>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2|   newcol|
+----+----+---------+
|   1|   A|   [3, 4]|
|   2|   B|[3, 4, 5]|
+----+----+---------+

我已经有一个udf解决方案，但是它非常慢（> 10亿个数据行）：

largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))

欢迎任何帮助。预先非常感谢。

Answer 1

火花<2.4

PySpark中没有udf的*合理替代品。

火花> = 2.4

您的代码：

expr("filter(col3, x -> x >= 3)")

可以原样使用。

参考

Querying Spark SQL DataFrame with complex types

*考虑到RDD udf的爆炸或转换成本几乎是最好的选择。

过滤数组列的内容

1 个答案: