Spark on Glue无法连接到AWS / ElasticSearch

时间:2020-04-27 07:16:24

标签: apache-spark elasticsearch aws-glue

我正在Glue内运行Spark,使用Spark的以下配置记录到AWS / ElasticSearch:

  conf.set("es.nodes", s"$nodes/$indexName")
  conf.set("es.port", "443")
  conf.set("es.batch.write.retry.count", "200")
  conf.set("es.batch.size.bytes", "512kb")
  conf.set("es.batch.size.entries", "500")
  conf.set("es.index.auto.create", "false")
  conf.set("es.nodes.wan.only", "true")
  conf.set("es.net.ssl", "true")

但是我得到的是以下错误:

diagnostics: User class threw exception: org.elasticsearch.hadoop.EsHadoopIllegalArgumentException: Cannot detect ES version - typically this happens if the network/Elasticsearch cluster is not accessible or when targeting a WAN/Cloud instance without the proper setting 'es.nodes.wan.only'
    at org.elasticsearch.hadoop.rest.InitializationUtils.discoverClusterInfo(InitializationUtils.java:340)
    at org.elasticsearch.spark.rdd.EsSpark$.doSaveToEs(EsSpark.scala:104)
    ....

我知道哪个“ VPC”在其中运行我的ElasticSearch实例,但是我不确定如何为Glue / Spark设置该值,或者是否有其他问题。有想法吗?

我还尝试添加一个“胶水jdbc”连接,该连接应使用正确的VPC连接,但是我不确定如何正确设置它:

  import scala.reflect.runtime.universe._
  def saveToEs[T <: Product : TypeTag](index: String, data: RDD[T]) =
    SparkProvider.glueContext.getJDBCSink(
      catalogConnection = "my-elasticsearch-connection",
      options = JsonOptions(
        "WHAT HERE?"
      ),
      transformationContext = "SinkToElasticSearch"
    ).writeDynamicFrame(DynamicFrame(
      SparkProvider.sqlContext.createDataFrame[T](data),
      SparkProvider.glueContext))

1 个答案:

答案 0 :(得分:0)

尝试创建以创建虚拟JDBC连接。虚拟连接将告诉Glue ES-VPC,子网和安全组。测试连接可能无法正常工作,但是当您使用该连接运行作业时,它将使用连接元数据在VPC中启动弹性网络接口,以促进这种通信。有关连接的更多信息,请参见:

[1] https://docs.aws.amazon.com/glue/latest/dg/start-connecting.html