我尝试将大约700M行的大型数据集从Google Cloud SQL加载到PySpark中,我在DataProc上使用Zeppelin。以下是我尝试做的一个示例
%pyspark
from pyspark import SparkContext, SparkConf, SQLContext, sql
sqlContext = SQLContext(sc)
jdbcDriver = 'com.mysql.jdbc.Driver'
jdbcUrl = 'jdbc:mysql://%s:3307/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)
df = sqlContext.read.format("jdbc").options(url =jdbcUrl,
driver=jdbcDriver,
dbtable='Table',
partitionColumn = 'appid',
lowerBound = '1',
upperBound = '324830',
numPartitions = '10',
user=CLOUDSQL_USER,
password=CLOUDSQL_PWD).load()
当我执行它时,8-10分钟后,它失败了。这是(据我所知)主要部分:
Py4JJavaError: An error occurred while calling o218.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3
in stage 2.0 (TID 11, ykog-dataproc-v2-w-3.c.decent-essence-135923.internal):
ExecutorLostFailure (executor 25 exited caused by one of the running
tasks) Reason: Container marked as failed:
container_1481727498237_0006_01_000026 on host: ykog-dataproc-v2-w-3.c.decent-essence-135923.internal.
Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
我认为问题是它试图将所有内容都装入内存中,这将耗尽。有没有办法设置pyspark.StorageLevel
或将大型数据集从SQL加载到PySpark的正确方法是什么?