我正在使用以下代码段将Hive连接并加载数据到Elasticsearch(v 6.2),没有任何问题
ADD JAR file:///<>/elasticsearch-hadoop-hive-6.2.2.jar;
ADD FILE file:///<>/mycerts.jks;
CREATE EXTERNAL TABLE if not exists my_db.my_es_table
(
col1 int,
col2 string,
col3 string,
col4 timestamp,
key_id string
)
COMMENT 'data into ES'
STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler'
TBLPROPERTIES('es.resource' = 'index1/type1',
'es.index.auto.create'='true',
'es.nodes'='<vip_name>',
'es.port'='9200',
'es.net.http.auth.user'='<user>',
'es.net.http.auth.pass'='pwd',
'es.net.ssl.protocol'='SSL',
'es.net.ssl'='TRUE',
'es.net.ssl.truststore.location'='mycerts.jks',
'es.net.ssl.truststore.pass'='<pwd>',
'es.mapping.id'='key_id'
);
INSERT OVERWRITE TABLE my_db.my_es_table
SELECT
col1,
col2,
col3,
col4,
key_id
FROM my_db.stagging_data;
但是,当试图将同一块迁移到py-spark时,它会抛出异常
org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Expected to find keystore file at [file:///<path>/mycerts.jks] but was unable to. Make sure that it is available on the classpath, or if not, that you have specified a valid URI
以下是我尝试获取火花的代码段
df_delta=sqlContext.table('my_db.stagging_data')
status=df_delta.rdd.map(lambda row:(None,row.asDict())).saveAsNewAPIHadoopFile(path='-', outputFormatClass="org.elasticsearch.hadoop.mr.EsOutputFormat",keyClass="org.apache.hadoop.io.NullWritable",valueClass="org.elasticsearch.hadoop.mr.LinkedMapWritable",conf={'es.resource' : 'index1/type1','es.index.auto.create':'true','es.nodes':'<vip_name>','es.port':'9200','es.net.http.auth.user':'<user>','es.net.http.auth.pass':'<pwd>','es.net.ssl':'true','es.net.ssl.truststore.location':'file:///<path>/mycerts.jks','es.net.ssl.truststore.pass':'<pwd>','es.mapping.id' : 'key_id'})
我正在使用以下命令调用外壳程序-
pyspark --jars <path>/elasticsearch-spark-20_2.11-6.2.2.jar --py-files <path>/mycerts.jks
下面我要添加整个日志
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Expected to find keystore file at [file:///<path>/mycerts.jks] but was unable to. Make sure that it is available on the classpath, or if not, that you have specified a valid URI.
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.loadKeyStore(SSLSocketFactory.java:193)
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.loadTrustManagers(SSLSocketFactory.java:224)
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.createSSLContext(SSLSocketFactory.java:171)
... 31 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1609)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1597)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1596)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1596)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1830)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1779)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1768)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:78)
... 26 more
Caused by: org.apache.spark.SparkException: Task failed while writing rows
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:155)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
... 1 more
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalStateException: Cannot initialize SSL - Expected to find keystore file at [file:///<path>/mycerts.jks] but was unable to. Make sure that it is available on the classpath, or if not, that you have specified a valid URI.
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.createSSLContext(SSLSocketFactory.java:173)
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.getSSLContext(SSLSocketFactory.java:158)
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.createSocket(SSLSocketFactory.java:127)
at org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707)
at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397)
at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323)
at org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport.execute(CommonsHttpTransport.java:478)
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:112)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:380)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:344)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:348)
at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:158)
at org.elasticsearch.hadoop.rest.RestClient.getHttpNodes(RestClient.java:115)
at org.elasticsearch.hadoop.rest.InitializationUtils.discoverNodesIfNeeded(InitializationUtils.java:92)
at org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:579)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.init(EsOutputFormat.java:173)
at org.elasticsearch.hadoop.mr.EsOutputFormat$EsRecordWriter.write(EsOutputFormat.java:149)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1413)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
... 8 more
Caused by: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Expected to find keystore file at [file:///<path>/mycerts.jks] but was unable to. Make sure that it is available on the classpath, or if not, that you have specified a valid URI.
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.loadKeyStore(SSLSocketFactory.java:193)
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.loadTrustManagers(SSLSocketFactory.java:224)
at org.elasticsearch.hadoop.rest.commonshttp.SSLSocketFactory.createSSLContext(SSLSocketFactory.java:171)
... 31 more
连接到py-spark后,我可以读取和打印jks文件。无法解决此问题。有人可以建议。
答案 0 :(得分:3)
我认为您使用的是错误的选项
对于Python,您可以使用spark-submit的
--py-files
参数添加 .py,.zip或.egg文件,以与您的应用程序一起分发
相反,您想要--files
--files FILES: Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName)
要将其放置在执行程序中的其他路径上,则可以使用#
分隔符
spark-submit ... --files mycerts.jks#/<path>/mycerts.jks
在代码内,您可以从SparkFiles.get("mycerts.jks")
获取对路径的引用,该引用将返回文件的绝对路径
答案 1 :(得分:0)
根据您的堆栈消息,看来您的<path
显然是错误的。您在--jars
和--py-files
之后的 path 值应该准确,但是目前还不正确。
预期在[file://///mycerts.jks]中找到密钥库文件,但是 无法。确保它在类路径上可用,或者如果 不是,您指定了有效的URI。
答案 2 :(得分:0)
正如上面的cricket_007所指出的,您使用的是不正确的选项--py-files。而是使用--files选项上传您的证书文件。
此外,这些文件不是上传到这些执行程序的本地文件系统中,而是上传到HDFS上。因此,您在Spark代码中为证书文件传递的路径也不正确,因为它指向本地文件系统。
'es.net.ssl.truststore.location':'file:///<path>/mycerts.jks'
您可以在spark Submit命令中使用#与文件分隔符选项,将文件上传到HDFS上指定的文件上
spark-submit ... --files mycerts.jks#/<path>/mycerts.jks
,然后在您的Spark代码中使用相同的路径来访问文件。
'es.net.ssl.truststore.location':'/<path>/mycerts.jks'