Question

我有一个具有以下架构的数据框

hello.printSchema()
root
 |-- list_a: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- list_b: array (nullable = true)
 |    |-- element: integer (containsNull = true)

以及以下示例数据

hello.take(2)
[Row(list_a=[7, 11, 1, 14, 13, 15,999], list_b=[15, 13, 7, 11, 1, 14]),
 Row(list_a=[7, 11, 1, 14, 13, 15], list_b=[11, 1, 7, 14, 15, 13, 12])]

所需的输出

排序list_a和list_b
创建新列list_diff，以使list_diff = list(set(list_a) - set(list_b))为空ArrayType（如果不存在这种差异）。

我尝试过的方法是UDF。

如question中所述，我正在尝试使用以下UDF

sort_udf=udf(lambda x: sorted(x), ArrayType(IntegerType()))
differencer=udf(lambda x,y: [elt for elt in x if elt not in y], ArrayType(IntegerType()))

不支持类似python列表的操作。

hello = hello.withColumn('sorted', sort_udf(hello.list_a))
hello = hello.withColumn('difference', differencer(hello.list_a, hello.list_b))

上述操作导致以下错误

Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
[Redacted Stack Trace]
TypeError: 'NoneType' object is not iterable

我在这里错过了什么吗？

Answer 1

错误消息：

TypeError: 'NoneType' object is not iterable

是python异常（与spark错误相反），这意味着您的代码在udf内部失败。您的问题是您的DataFrame中有一些null值。因此，当您调用udf时，可能会将None的值传递给sorted：

>>> sorted(None)
TypeErrorTraceback (most recent call last)
<ipython-input-72-edb1060f46c4> in <module>()
----> 1 sorted(None)

TypeError: 'NoneType' object is not iterable

解决此问题的方法是使您的udf对错误的输入更可靠。就您而言，您可以更改函数以处理null输入，如下所示：

# return None if input is None
sort_udf = udf(lambda x: sorted(x) if x is not None else None, ArrayType(IntegerType()))

# return None if either x or y are None
differencer = udf(
    lambda x,y: [e for e in x if e not in y] if x is not None and y is not None else None,
    ArrayType(IntegerType())
)

但是，sort_udf功能不是必需的，因为您可以使用pyspark.sql.functions.sort_array()。

在PySpark DataFrame中的ArrayType上使用udf，'NoneType'对象不是不可迭代的错误

1 个答案: