在PySpark中读取大文件时出现读取错误

时间:2019-11-28 03:48:14

标签: python unix amazon-s3 pyspark pyspark-dataframes

我正在尝试读取文本文件(输入),并获取如下所示的记录数。该语句继续失败,并显示以下错误:

“java.io.IOException: Too many bytes before newline”

这是日志:

19/11/27 13:33:11 ERROR TaskSetManager: Task 65 in stage 0.0 failed 4 times; aborting job

Traceback (most recent call last):

  File "/mnt/tmp/spark-e0780800-8429-4290-a011-5b6d3239ade4/raw_tier_process.py", line 144, in <module>

    input_count = sc.textFile(source_name).filter(lambda x: len(x) > 0).count()

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco

  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value

py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.

我还尝试使用以下命令读取文件以调查问题,但它也失败:

target_obj = spark.read.csv(path="s3://test-landing-bucket/iros/iRos_File-Customer_Product_20191024_121001.dat",sep="^A",quote = '',header="true", mode="PERMISSIVE")

我要读取的文件是一个10GB文本文件,存储在S3上,其中^A作为字段定界符,\n作为记录定界符。我尝试从S3到我的UNIX本地cp文件,用vi editor打开以查看错误,但由于容量太大而无法打开。

还有其他方法可以打开文件来调查UNIX上大块文件的内容吗?

有人可以帮助我修复错误并缓解此问题。

谢谢

0 个答案:

没有答案