我正在尝试读取文本文件(输入),并获取如下所示的记录数。该语句继续失败,并显示以下错误:
“java.io.IOException: Too many bytes before newline”
这是日志:
19/11/27 13:33:11 ERROR TaskSetManager: Task 65 in stage 0.0 failed 4 times; aborting job
Traceback (most recent call last):
File "/mnt/tmp/spark-e0780800-8429-4290-a011-5b6d3239ade4/raw_tier_process.py", line 144, in <module>
input_count = sc.textFile(source_name).filter(lambda x: len(x) > 0).count()
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1055, in count
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1046, in sum
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 917, in fold
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 816, in collect
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
我还尝试使用以下命令读取文件以调查问题,但它也失败:
target_obj = spark.read.csv(path="s3://test-landing-bucket/iros/iRos_File-Customer_Product_20191024_121001.dat",sep="^A",quote = '',header="true", mode="PERMISSIVE")
我要读取的文件是一个10GB
文本文件,存储在S3
上,其中^A
作为字段定界符,\n
作为记录定界符。我尝试从S3
到我的UNIX
本地cp文件,用vi editor
打开以查看错误,但由于容量太大而无法打开。
还有其他方法可以打开文件来调查UNIX上大块文件的内容吗?
有人可以帮助我修复错误并缓解此问题。
谢谢