我尝试使用--driver-class-path进行spark-submit,使用--jars并尝试使用此方法https://petz2000.wordpress.com/2015/08/18/get-blas-working-with-spark-on-amazon-emr/
在命令行中使用SPARK_CLASSPATH,如
SPARK_CLASSPATH=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar pyspark
我收到此错误
Found both spark.executor.extraClassPath and SPARK_CLASSPATH. Use only the former.
但是我无法添加它。如何添加postgresql JDBC jar文件以便在pyspark中使用它?我正在使用EMR版本4.2
由于
答案 0 :(得分:2)
1)清除环境变量:
unset SPARK_CLASSPATH
2)使用--jars选项在群集上分发postgres驱动程序:
pyspark --jars=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar
//or
spark-submit --jars=/home/hadoop/pg_jars/postgresql-9.4.1208.jre7.jar <your py script or app jar>
答案 1 :(得分:2)
将jar路径添加到/etc/spark/conf/spark-defaults.conf
行的spark.driver.extraClassPath
解决了我的问题。
答案 2 :(得分:1)
我通常使用下面的方法,效果很好。
1)第1步:使用引导程序操作shell脚本下载postgres驱动程序jar:
#!/bin/bash
mkdir -p /home/hadoop/lib/
cd /home/hadoop/lib
wget https://jdbc.postgresql.org/download/postgresql-42.2.12.jar
chmod +x postgresql-42.2.12.jar
2)步骤2:在集群EMR设置上添加配置JSON,以将jar文件包含到spark-defaults文件的执行程序和驱动程序extraClassPath中:
{
"Classification": "spark-defaults",
"Properties": {
"spark.executor.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/lib/postgresql-42.2.12.jar",
"spark.driver.extraClassPath": ":/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/lib/postgresql-42.2.12.jar",
},
}
答案 3 :(得分:0)
如果您的 EMR 集群可以访问互联网,您可以使用 Maven:
$ spark-sql --packages org.postgresql:postgresql:42.2.18 --driver-class-path ~/.ivy2/jars/org.postgresql_postgresql-42.2.18.jar
其中 spark-sql
可以替换为 pyspark
或其他 Spark CLI
这会将 PostgreSQL JDBC 驱动程序和依赖项从 Maven Central 下载到您的 EMR Master,最有可能在 /home/hadoop/.ivy2/jars/
(您可以查看 spark 控制台/日志进行仔细检查),并加载那个驱动进入 Spark