我已经加载了一个大约20 GB左右的数据集 - 集群有〜1TB可用,所以内存不应该是一个问题imho。
保存仅包含字符串的原始数据是没有问题的:
df_data.write.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
但是,当我转换数据时:
df_transformed = df_data.drop('bri').join(
df_data[['docId', 'bri']].rdd\
.map(lambda x: (x.docId, json.loads(x.bri))
if x.bri is not None else (x.docId, dict()))\
.toDF()\
.withColumnRenamed('_1', 'docId')\
.withColumnRenamed('_2', 'bri'),
['dokumentId']
)
然后保存:
df_transformed.parquet(os.path.join(DATA_SET_BASE, 'concatenated.parquet'), mode='overwrite')
日志输出将告诉我超出了内存限制:
18/03/08 10:23:09 WARN TaskSetManager: Lost task 17.0 in stage 18.3 (TID 2866, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 29.0 in stage 18.3 (TID 2878, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
18/03/08 10:23:09 WARN TaskSetManager: Lost task 65.0 in stage 18.3 (TID 2914, worker06.hadoop.know-center.at): ExecutorLostFailure (executor 40 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 15.2 GB of 13.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
我不太确定问题是什么。即使将执行程序的内存设置为60GB RAM也无法解决问题。
所以,显然问题在于转型。知道究竟是什么导致了这个问题吗?