我正在读取一个大的 txt 文件,该文件包含15列和2300,000,000行,我对pyspark数据框进行了以下操作:
我执行了写入数据帧的命令,并在17小时后收到以下错误:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 19 in stage 17.0 failed 4 times, most recent failure: Lost task 19.3 in stage 17.0 (TID 87, 10.175.252.55, executor 6): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: worker lost
这是我读取数据框的方式:
df = sqlContext.read.csv("file.txt", header=True)
这就是我写实木复合地板的方式:
df.write.option("compression", "gzip").parquet( "file.parquet" )
我用一个具有270列和400,000,000行且工作正常的文件尝试了相同的代码