将大型SQL数据集加载到PySpark

时间:2017-01-17 13:49:32

标签: python apache-spark pyspark google-cloud-sql google-cloud-dataproc

我尝试将大约700M行的大型数据集从Google Cloud SQL加载到PySpark中,我在DataProc上使用Zeppelin。以下是我尝试做的一个示例

%pyspark
from pyspark import SparkContext, SparkConf, SQLContext, sql
sqlContext = SQLContext(sc)

jdbcDriver = 'com.mysql.jdbc.Driver'
jdbcUrl = 'jdbc:mysql://%s:3307/%s?user=%s&password=%s' % (CLOUDSQL_INSTANCE_IP, CLOUDSQL_DB_NAME, CLOUDSQL_USER, CLOUDSQL_PWD)

df = sqlContext.read.format("jdbc").options(url =jdbcUrl,
                                   driver=jdbcDriver,
                                   dbtable='Table',
                                   partitionColumn = 'appid',
                                   lowerBound = '1',
                                   upperBound = '324830',
                                   numPartitions = '10',
                                   user=CLOUDSQL_USER,
                                   password=CLOUDSQL_PWD).load()

当我执行它时,8-10分钟后,它失败了。这是(据我所知)主要部分:

Py4JJavaError: An error occurred while calling o218.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3
in stage 2.0 (TID 11, ykog-dataproc-v2-w-3.c.decent-essence-135923.internal): 
ExecutorLostFailure (executor 25 exited caused by one of the running
tasks) Reason: Container marked as failed: 
container_1481727498237_0006_01_000026 on host: ykog-dataproc-v2-w-3.c.decent-essence-135923.internal. 
Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal

我认为问题是它试图将所有内容都装入内存中,这将耗尽。有没有办法设置pyspark.StorageLevel或将大型数据集从SQL加载到PySpark的正确方法是什么?

0 个答案:

没有答案