Question

我有以下要解决的火花/烫伤问题

我有这个DF

+--------------+--------------------+
|co_tipo_arquiv|          errorCodes|
+--------------+--------------------+
|            05|[10531, 20524, 10...|

此架构：

root
 |-- co_tipo_arquiv: string (nullable = true)
 |-- errorCodes: array (nullable = true)
 |    |-- element: string (containsNull = true)

我需要检查我的错误列表（list_erors）中是否有任何代码在errorCodes列的df中

val list_erors = List("10531","10144")

我尝试了一下，但是没用

dfNire.filter（col（“ errorCodes”）。isin（list_erors））。show（）

Answer 1

Spark 2.4 +

您可以将array_intersect函数用于错误数组。

val list_errors = Array("10531","10144")

df.withColumn("intersect", array_intersect(col("errors"), lit(list_errors))).show(false)

然后，结果如下所示：

+---+---------------------+---------+
|id |errors               |intersect|
+---+---------------------+---------+
|05 |[10531, 20524, 11111]|[10531]  |
+---+---------------------+---------+

在我的测试中，列名是临时的。

Answer 2

如果您要检查/列出数组是否包含任何 let dest = src; ，则：

list_errors

不使用udf来获取行的另一种方法是使用array_intersect，然后仅列出 df.show() //+------------------+ //| errorCodes| //+------------------+ //|[10531, 20254, 10]| //| [10]| //+------------------+ def is_exists_any(s: Seq[String]): UserDefinedFunction = udf((c: collection.mutable.WrappedArray[String]) => c.toList.intersect(s).nonEmpty) val list_errors = Seq("10531", "10144") df.withColumn("is_exists",is_exists_any(list_errors)(col("errorCodes"))).filter(col("is_exists") === true).show() //+------------------+---------+ //| errorCodes|is_exists| //+------------------+---------+ //|[10531, 20254, 10]| true| //+------------------+---------+ 不是size of array的行。

比较DF结构数组火花中的值

2 个答案: