有人知道为什么会这样吗:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1089.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1089.0 (TID 1951, ip-10-0-208-38.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 101, in main
process()
File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 96, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 236, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "<ipython-input-762-e1c8f006c3c2>", line 4, in getPdfData
File "<ipython-input-762-e1c8f006c3c2>", line 85, in extractData
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2566, in extractText
content = ContentStream(content, self.pdf)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2644, in __init__
stream = BytesIO(b_(stream.getData()))
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/generic.py", line 837, in getData
decoded._data = filters.decodeStreamData(self)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 346, in decodeStreamData
data = FlateDecode.decode(data, stream.get("/DecodeParms"))
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 111, in decode
data = decompress(data)
File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 49, in decompress
return zlib.decompress(data)
error: Error -5 while decompressing data: incomplete or truncated stream
我使用PyPDF2处理工作人员的PDF文件,创建PyPDFpdfObject,调用getDocumentInfo()并在pageObjects上调用extract_text()。我没有明确使用zlib模块,其中这个&#39;压缩&#39;错误通常根据称为互联网的oracle发生。 对于存储在RDD(3名工作人员)中的较少数量的PDF(大约500左右),我的代码运行得非常好,但是当我将其扩展到5000或更高时,它会出错。有什么想法吗?