我使用 withColumn 和 UDF 来获取一个新列,然后选择两列并将其分配给一个新的 df。但是当我在这个新的 df 上运行 count() 时,它给了我 TypeError: 'NoneType' object is not subscriptable。 show() 工作正常。我正在尝试获取新 df 的长度。 这是我的代码:
# Find all entities with names that are palindromes
# (name reads the same way forward and reverse, e.g. madam):
# print the count and show() the resulting Spark DataFrame
from pyspark.sql.types import BooleanType
def is_palindrome(entity_name):
return entity_name == entity_name[::-1]
spark_udf = udf(is_palindrome, BooleanType())
palindrome_df = cb_sdf.withColumn('is_palindrome', spark_udf('name'))
palindrome_df = palindrome_df.where(palindrome_df['is_palindrome']).select('name', 'is_palindrome')
print(palindrome_df.show())
print(palindrome_df.count())
这是我得到的输出和错误消息:
+------+-------------+
| name|is_palindrome|
+------+-------------+
| KAYAK| true|
| ooVoo| true|
| 63336| true|
| TipiT| true|
| beweb| true|
| CSC| true|
| CBC| true|
| OQO| true|
| SAS| true|
| e4e| true|
| PHP| true|
| ivi| true|
| ADDA| true|
|izeezi| true|
| siXis| true|
| STATS| true|
| 8x8| true|
| IXI| true|
| GLG| true|
| 2e2| true|
+------+-------------+
only showing top 20 rows
None
---------------------------------------------------------------------------
PythonException Traceback (most recent call last)
<ipython-input-24-7fd424328e85> in <module>()
10 palindrome_df = palindrome_df.where(palindrome_df['is_palindrome']).select('name', 'is_palindrome')
11 print(palindrome_df.show())
---> 12 print(palindrome_df.count())
2 frames
/usr/local/lib/python3.7/dist-packages/pyspark/sql/utils.py in deco(*a, **kw)
115 # Hide where the exception came from that shows a non-Pythonic
116 # JVM exception message.
--> 117 raise converted from None
118 else:
119 raise
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
for obj in iterator:
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/serializers.py", line 200, in _batched
for item in iterator:
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
return lambda *a: f(*a)
File "/usr/local/lib/python3.7/dist-packages/pyspark/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
return f(*args, **kwargs)
File "<ipython-input-24-7fd424328e85>", line 7, in is_palindrome
TypeError: 'NoneType' object is not subscriptable
先谢谢你!
答案 0 :(得分:0)
您的数据框中的某处可能有空值,但在您显示的前 20 行中没有。这就是为什么在计算整个数据帧时出现错误,但在显示数据帧中的 20 行时却没有出现错误的原因。
为了防止空值导致程序崩溃,请将您的 udf 更改为:
def is_palindrome(entity_name):
if entity_name is None:
return None
else:
return entity_name == entity_name[::-1]