PySpark无法解释的行为并删除null

时间:2018-10-07 04:02:35

标签: python apache-spark pyspark apache-spark-sql

我在PySpark中有一个Spark DataFrame,我试图从中删除null。

之前,在解析过程中进行清理时,我在convert_to_null列上运行了title方法,该方法基本上检查字符串是否从字面上说"None",如果是,则将其转换为实际的None。这样,Spark将其转换为内部null类型。

现在,我正在尝试在title列中删除具有该null类型的行。这是我尝试删除空值的所有内容:

new_df = df.na.drop('title')

new_df = df[F.col('title').isNotNull()]

new_df = df[~F.col('title').isNull()]

但是我总是在以下几行的new_df.show()调用中遇到此错误:

  

Py4JJavaError: An error occurred while calling o2022.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 87.0 failed 1 times, most recent failure: Lost task 1.0 in stage 87.0 (TID 314, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 230, in main process() File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 225, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 324, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 139, in dump_stream for obj in iterator: File "/usr/local/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 313, in _batched for item in iterator: File "<string>", line 1, in <lambda> File "/usr/local/spark/python/lib/pyspark.zip/pyspark/worker.py", line 75, in <lambda> return lambda *a: f(*a) File "/usr/local/spark/python/pyspark/util.py", line 55, in wrapper return f(*args, **kwargs) File "<ipython-input-16-48bc3ec1b5d9>", line 5, in replace_none_with_null    TypeError: 'in <string>' requires string as left operand, not NoneType

我想我快要疯了。我不知道如何解决问题。任何帮助表示赞赏。谢谢!

0 个答案:

没有答案