我有一个Elasticsearch索引,有近3.2亿个文档,大小为68 GB,分为5个分片。
我想要的是从Spark读取整个索引以将其转换为镶木地板格式。但是,数据太大而无法放入内存中,因此会出现以下异常:
ERROR NetworkClient: Node [127.0.0.1:9200] failed (Read timed out); no other nodes left - aborting...
ERROR Utils: Aborting task
org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[127.0.0.1:9200]]
at org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:149)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:466)
at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:450)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:391)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$DynamicPartitionWriteTask.execute(FileFormatWriter.scala:365)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
通过这种方式,我尝试将scroll.limit属性设置为1000,以便读取1000个文档的块中的数据,但抛出相同的异常。看看我遇到的官方文档"切片滚动"必须管理scroll_id才能处理下一批。如果我错了,请纠正我,但理论上,Spark必须逐批循环数据,直到没有更多数据。但是,我找不到如何用Spark实现它。
我通过手动过滤(下推)数据来解决这个问题,从而减少了向Elasticsearch请求的数据量。我使用时间戳来限制查询的响应。我不得不多次查询Elasticsearch才能读取整个索引。基本上,我手动切片滚动。如您所见,这不是解决问题的最佳方法。那么,您是否有任何建议我如何解决它以自动方式读取整个数据?
请注意,Elasticsearch和Spark都在我的本地计算机上运行(16 GB RAM和4个核心)。这是我的代码和依赖项:
带滚动限制的代码(失败)
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("ElasticSearch to Parquet")
.set("es.nodes", "localhost")
.set("es.port", "9200")
.set("es.index.auto.create", "false")
.set("es.nodes.wan.only", "false")
val sparkSession = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
val df = sparkSession.sqlContext.read
.format("org.elasticsearch.spark.sql")
.option("scroll.limit", 1000)
.load("my-index/index")
df.write.format("parquet").mode("append").save("data/data.parquet")
具有下推过滤功能的代码(根据需要多次重复查询并更改开始和结束时间戳)
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("ElasticSearch to Parquet")
.set("es.nodes", "localhost")
.set("es.port", "9200")
.set("es.index.auto.create", "false")
.set("es.nodes.wan.only", "false")
val sparkSession = SparkSession
.builder
.config(sparkConf)
.getOrCreate()
val df = sparkSession.sqlContext.read
.format("org.elasticsearch.spark.sql")
.load("my-index/index")
val filter = df.filter(df("timestamp").gt("dateStart").and(df("timestamp").lt("dateEnd")))
filter.write.format("parquet").mode("append").save("data/data.parquet")
的pom.xml
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>6.0.0</version>
</dependency>
</dependencies>
答案 0 :(得分:0)
似乎是网络问题Node [127.0.0.1:9200] failed (Read timed out); no other nodes left - aborting...
我解释了Spark(我的意思是Spark + elastic4Hadoop lib)如何与elasticsearch一起工作:
shards
,可通过HTTP在数据节点上使用(如果启用了HTTP)partitions of RDD
由于您将es.nodes.wan.only
设置为false
,Spark将首先GET /_cat/nodes
获取Elasticsearch群集节点IP,然后映射IP <->分片。
因此,当Spark提取数据时,每个Spark工作程序任务都可以与位于不同Elasticsearch节点上的任何Elasticsearch节点分片进行通信。这就是为什么并置通常是一个非常主意的想法(将Spark工作者和elasticsearch放在同一台机器上,即big data
的想法试图尽可能地执行计算)。
此外,这里最重要的调整是:“由于Spark工作者拥有CPU,因此具有许多分片”。 (但是自es6具有sliced scroll
功能以来,情况就不再如此。