Question

有人知道为什么会这样吗：

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1089.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1089.0 (TID 1951, ip-10-0-208-38.ec2.internal): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 101, in main
    process()
  File "/home/ubuntu/databricks/spark/python/pyspark/worker.py", line 96, in process
    serializer.dump_stream(func(split_index, iterator), outfile)
  File "/home/ubuntu/databricks/spark/python/pyspark/serializers.py", line 236, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "<ipython-input-762-e1c8f006c3c2>", line 4, in getPdfData
  File "<ipython-input-762-e1c8f006c3c2>", line 85, in extractData
  File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2566, in extractText
    content = ContentStream(content, self.pdf)
  File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/pdf.py", line 2644, in __init__
    stream = BytesIO(b_(stream.getData()))
  File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/generic.py", line 837, in getData
    decoded._data = filters.decodeStreamData(self)
  File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 346, in decodeStreamData
    data = FlateDecode.decode(data, stream.get("/DecodeParms"))
  File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 111, in decode
    data = decompress(data)
  File "./addedFile3142227314340912289bfddea6e_2a71_413e_b077_a4493987f9d9_PyPDF2-12df7.egg/PyPDF2/filters.py", line 49, in decompress
    return zlib.decompress(data)
error: Error -5 while decompressing data: incomplete or truncated stream

我使用PyPDF2处理工作人员的PDF文件，创建PyPDFpdfObject，调用getDocumentInfo（）并在pageObjects上调用extract_text（）。我没有明确使用zlib模块，其中这个＆＃39;压缩＆＃39;错误通常根据称为互联网的oracle发生。对于存储在RDD（3名工作人员）中的较少数量的PDF（大约500左右），我的代码运行得非常好，但是当我将其扩展到5000或更高时，它会出错。有什么想法吗？

＆＃39;解压数据时出现错误-5＆＃39;在Spark中，在PyPDF2 lib中

0 个答案: