我在PySpark中有一个Spark DataFrame,我试图从中删除null。
之前,在解析过程中进行清理时,我在convert_to_null
列上运行了title
方法,该方法基本上检查字符串是否从字面上说"None"
,如果是,则将其转换为实际的None
。这样,Spark将其转换为内部null类型。
现在,我正在尝试在title
列中删除具有该null类型的行。这是我尝试删除空值的所有内容:
new_df = df.na.drop('title')
new_df = df[F.col('title').isNotNull()]
new_df = df[~F.col('title').isNull()]
但是我总是在以下几行的new_df.show()
调用中遇到此错误:
Py4JJavaError: An error occurred while calling o2022.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 87.0 failed 1 times, most recent failure: Lost task 1.0 in stage 87.0 (TID 314, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main process() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 324, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in dump_stream for obj in iterator: File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 313, in _batched for item in iterator: File "<string>", line 1, in <lambda> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 75, in <lambda> return lambda *a: f(*a) File "/usr/local/spark/python/pyspark/util.py", line 55, in wrapper return f(*args, **kwargs) File "<ipython-input-16-48bc3ec1b5d9>", line 5, in replace_none_with_null
TypeError: 'in <string>' requires string as left operand, not NoneType
我想我快要疯了。我不知道如何解决问题。任何帮助表示赞赏。谢谢!