保存镶木地板文件时丢失执行程序

时间:2018-03-08 09:59:41

标签: pyspark parquet

我已经加载了一个大约20 GB左右的数据集 - 集群有〜1TB可用,所以内存不应该是一个问题imho。

保存仅包含字符串的原始数据是没有问题的:

df_data.write.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')

但是,当我转换数据时:

df_transformed = df_data.drop('bri').join(
    df_data[['docId', 'bri']].rdd\
        .map(lambda x: (x.docId, json.loads(x.bri)) 
             if x.bri is not None else (x.docId, dict()))\
        .toDF()\
        .withColumnRenamed('_1', 'docId')\
        .withColumnRenamed('_2', 'bri'),
    ['dokumentId']
)

然后保存:

df_transformed.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')

日志输出将告诉我超出了内存限制:

18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

我不太确定问题是什么。即使将执行程序的内存设置为60GB RAM也无法解决问题。

所以,显然问题在于转型。知道究竟是什么导致了这个问题吗?

0 个答案:

没有答案