Py4JJavaError:调用o57.showString时发生错误。 :org.apache.spark.SparkException:

时间:2019-06-12 14:07:12

标签: amazon-web-services pyspark

我正在使用pyspark连接到运行数据库25 GB的AWS实例(r5d.xlarge 4 vCPU 32 GiB),当我运行某些表时出现错误:

Py4JJavaError:调用o57.showString时发生错误。 :org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段0.0中的任务0失败1次,最近一次失败:阶段0.0中的任务0.0丢失(TID 0,本地主机,执行程序驱动程序):java.lang.OutOfMemoryError :超出了GC开销限制

我试图为自己找出错误,但不幸的是,有关此问题的信息不多。

代码


type3

在这里我得到了printSchema,但是:


from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').\
     config('spark.jars.packages', 'mysql:mysql-connector-java:5.1.44').\
     appName('test').getOrCreate()

df = spark.read.format('jdbc').\
        option('url', 'jdbc:mysql://xx.xxx.xx.xxx:3306').\
        option('driver', 'com.mysql.jdbc.Driver').\
        option('user', 'xxxxxxxxxxx').\
        option('password', 'xxxxxxxxxxxxxxxxxxxx').\
        option('dbtable', 'dbname.tablename').\
        load()

  df.printSchema()

有人知道如何解决这个问题吗?

1 个答案:

答案 0 :(得分:0)

这是一种在多个spark workers之间并行化串行JDBC读取的方法...您可以以此为指导根据自己的源数据对其进行自定义...基本上,主要前提是要具有某种独特的分割钥匙。

请具体参考本文档的参数partitionColumn, lowerBound, upperBound, numPartitions

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

一些代码示例:

# find min and max for column used to split on
from pyspark.sql.functions import min, max

minDF = df.select(min("id")).first()[0] # replace 'id' with your key col
maxDF = df.select(max("id")).first()[0] # replace 'id' with your key col
numSplits = 125 # you will need to tailor this value to your dataset ... you mentioned your source as 25GB so try 25000 MB / 200 MB = 125 partitions

print("df min: {}\df max: {}".format(minDF, maxDF))

# your code => add a few more parameters
df = spark.read.format('jdbc').\
        option('url', 'jdbc:mysql://xx.xxx.xx.xxx:3306').\
        option('driver', 'com.mysql.jdbc.Driver').\
        option('user', 'xxxxxxxxxxx').\
        option('password', 'xxxxxxxxxxxxxxxxxxxx').\
        option('dbtable', 'dbname.tablename').\
        option('partitionColumn', 'id').\ # col to split on
        option('lowerBound', minDF).\ # min value
        option('upperBound', maxDF).\ # max value
        option('numPartitions', numSplits).\ # num of splits (partitions) spark will distribute across executor workers
        load()

print(df.rdd.getNumPartitions())

另一个示例连接字符串=>如果您使用的是spark 2.4,请参阅本文档,它使用一些更简洁的代码

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#manage-parallelism

sourceDF = spark.read.jdbc(
  url=jdbcUrl, 
  table="dbname.tablename",
  column='"id"',
  lowerBound=minDF,
  upperBound=maxDF,
  numPartitions=125,
  properties=connectionProps
)