通过亚马逊EMR上的spark-submit添加postgresql jar

时间:2016-05-10 06:28:19

标签: apache-spark amazon pyspark apache-spark-sql emr

我尝试使用--driver-class-path进行spark-submit,使用--jars并尝试使用此方法https://petz2000.wordpress.com/2015/08/18/get-blas-working-with-spark-on-amazon-emr/

在命令行中使用SPARK_CLASSPATH,如

SPARK_CLASSPATH=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar pyspark

我收到此错误

Found both spark.executor.extraClassPath and SPARK_CLASSPATH. Use only the former.

但是我无法添加它。如何添加postgresql JDBC jar文件以便在pyspark中使用它?我正在使用EMR版本4.2

由于

4 个答案:

答案 0 :(得分:2)

1)清除环境变量:

unset SPARK_CLASSPATH

2)使用--jars选项在群集上分发postgres驱动程序:

pyspark --jars=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar
//or
spark-submit --jars=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar <your py script or app jar>

答案 1 :(得分:2)

将jar路径添加到/etc/spark/conf/spark-defaults.conf行的spark.driver.extraClassPath解决了我的问题。

答案 2 :(得分:1)

我通常使用下面的方法,效果很好。

1)第1步:使用引导程序操作shell脚本下载postgres驱动程序jar:

#!/bin/bash

mkdir -p /home/hadoop/lib/
cd /home/hadoop/lib

wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar
chmod +x postgresql-42.2.12.jar

2)步骤2:在集群EMR设置上添加配置JSON,以将jar文件包含到spark-defaults文件的执行程序和驱动程序extraClassPath中:

{
    "Classification": "spark-defaults",
    "Properties": {
        "spark.executor.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/lib/postgresql-42.2.12.jar",
        "spark.driver.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/lib/postgresql-42.2.12.jar",
    },
}

答案 3 :(得分:0)

如果您的 EMR 集群可以访问互联网,您可以使用 Maven:

$ spark-sql --packages org.postgresql:postgresql:42.2.18 --driver-class-path ~/.ivy2/jars/org.postgresql_postgresql-42.2.18.jar

其中 spark-sql 可以替换为 pyspark 或其他 Spark CLI

这会将 PostgreSQL JDBC 驱动程序和依赖项从 Maven Central 下载到您的 EMR Master,最有可能在 /home/hadoop/.ivy2/jars/(您可以查看 spark 控制台/日志进行仔细检查),并加载那个驱动进入 Spark