在 UDF 的 withColumn 之后,运行 count() 给出 TypeError: 'NoneType' object is not subscriptable

时间:2021-04-05 21:52:25

标签: python apache-spark pyspark apache-spark-sql

我使用 withColumn 和 UDF 来获取一个新列,然后选择两列并将其分配给一个新的 df。但是当我在这个新的 df 上运行 count() 时,它给了我 TypeError: 'NoneType' object is not subscriptable。 show() 工作正常。我正在尝试获取新 df 的长度。 这是我的代码:

# Find all entities with names that are palindromes 
# (name reads the same way forward and reverse, e.g. madam):
# print the count and show() the resulting Spark DataFrame
from pyspark.sql.types import BooleanType

def is_palindrome(entity_name):
    return entity_name == entity_name[::-1]
spark_udf = udf(is_palindrome, BooleanType())
palindrome_df = cb_sdf.withColumn('is_palindrome', spark_udf('name'))
palindrome_df = palindrome_df.where(palindrome_df['is_palindrome']).select('name', 'is_palindrome')
print(palindrome_df.show())
print(palindrome_df.count())

这是我得到的输出和错误消息:

+------+-------------+
|  name|is_palindrome|
+------+-------------+
| KAYAK|         true|
| ooVoo|         true|
| 63336|         true|
| TipiT|         true|
| beweb|         true|
|   CSC|         true|
|   CBC|         true|
|   OQO|         true|
|   SAS|         true|
|   e4e|         true|
|   PHP|         true|
|   ivi|         true|
|  ADDA|         true|
|izeezi|         true|
| siXis|         true|
| STATS|         true|
|   8x8|         true|
|   IXI|         true|
|   GLG|         true|
|   2e2|         true|
+------+-------------+
only showing top 20 rows

None
---------------------------------------------------------------------------
PythonException                           Traceback (most recent call last)
<ipython-input-24-7fd424328e85> in <module>()
     10 palindrome_df = palindrome_df.where(palindrome_df['is_palindrome']).select('name', 'is_palindrome')
     11 print(palindrome_df.show())
---> 12 print(palindrome_df.count())

2 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
    115                 # Hide where the exception came from that shows a non-Pythonic
    116                 # JVM exception message.
--> 117                 raise converted from None
    118             else:
    119                 raise

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
    for obj in iterator:
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 200, in _batched
    for item in iterator:
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-24-7fd424328e85>", line 7, in is_palindrome
TypeError: 'NoneType' object is not subscriptable

先谢谢你!

1 个答案:

答案 0 :(得分:0)

您的数据框中的某处可能有空值,但在您显示的前 20 行中没有。这就是为什么在计算整个数据帧时出现错误,但在显示数据帧中的 20 行时却没有出现错误的原因。

为了防止空值导致程序崩溃,请将您的 udf 更改为:

def is_palindrome(entity_name):
    if entity_name is None:
        return None
    else:
        return entity_name == entity_name[::-1]