Question

我正在使用Scala 2.12，我们有Elasticsearch 5.2.2。我的要求是仅基于条件进行获取/搜索。搜索一次将返回10,000多个文档或消息。因此，我无法使用常规搜索。数据（每个文档/消息）是一个复杂的JSON，稍后我可以对其进行解析。因此，我需要获取所有此类消息并将其存储在单个Json列表或任何其他内容中。我对Scala不太熟练，可以在Scala中使用Elastic4s进行搜索。我看到它具有滚动和扫描选项，但是没有找到任何完整的工作示例。因此，寻求帮助。

我看到了一些示例代码，如下所示，但需要更多帮助来获取所有内容并将其放入上面：

client.execute {
   search in "index" / "type" query <yourquery> scroll "1m"
}

client.execute {
   search scroll <id>
}

但是如何获取滚动ID以及如何继续获取所有数据？

更新：

上面提到了scala版本和ES版本。

我正在使用以下示例：

SBT：

libraryDependencies += "com.sksamuel.elastic4s" %% "elastic4s-core" % "7.0.2"

libraryDependencies += "com.sksamuel.elastic4s" %% "elastic4s-http" % "5.5.10"

libraryDependencies += "com.sksamuel.elastic4s" %% "elastic4s-http-streams" % "6.5.1"

libraryDependencies += "org.elasticsearch" % "elasticsearch" % "5.6.0"

代码：

import com.sksamuel.elastic4s.ElasticsearchClientUri
import com.sksamuel.elastic4s.requests.common.RefreshPolicy
import com.sksamuel.elastic4s.http.{ElasticClient, ElasticProperties}
import com.sksamuel.elastic4s.http.Response
import com.sksamuel.elastic4s.http.search.SearchResponse
import com.sksamuel.elastic4s.HttpClient

import com.sksamuel.elastic4s.http.ElasticDsl._

val client = HttpClient(ElasticsearchClientUri("host", 9200))

val resp1 = client.execute {
     search("index")
       .matchQuery("key", "value")
       .scroll("1m")
       .limit(500)
   }.await.result

val resp2 = client.execute {
      searchScroll(resp1.scrollId.get).keepAlive(1.minute)
    }.await

我认为我没有为Elastic4s模块使用正确的版本。

问题：

导入com.sksamuel.elastic4s.HttpClient：它无法识别HttpClient类。由于它显示错误，所以我尝试初始化“客户端”变量时找不到HttpClient。
接下来，在我的resp2中，当我尝试获取“ scrollId”时，它无法识别。如何从resp1中获取scrollId？

基本上，这里缺少什么？

更新2：

我根据github（示例）上的示例更改了以下依赖项的版本

libraryDependencies += "com.sksamuel.elastic4s" %% "elastic4s-http" % "6.3.3"

代码：

val client = ElasticClient(ElasticProperties("http://host:9200"))

现在，我遇到以下错误；

错误：

Symbol 'type <none>.term.BuildableTermsQuery' is missing from the classpath.
[error] This symbol is required by 'method com.sksamuel.elastic4s.http.search.SearchHandlers.BuildableTermsNoOp'.
[error] Make sure that type BuildableTermsQuery is in your classpath and check for conflicting dependencies with `-Ylog-classpath`.
[error] A full rebuild may help if 'SearchHandlers.class' was compiled against an incompatible version of <none>.term.
[error]     val client = ElasticClient(ElasticProperties("host:9200"))
[error]                                                 ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed

Answer 1

我个人将Akka Streams用于这种类型的工作流，因为它使并行处理和工作流构建更容易。

参考文档可能有点密集，但是基本思想是从具有一个输出的Source开始...将其推入任意数量的Flows ...然后收集在Sink中。

Elastic4s支持（几乎）本机使用，因此您不必直接处理滚动等。

现在，我不知道您想如何处理您的记录。但是，为您的数据创建源将是这样的：

import akka.stream.scaladsl.{GraphDSL, Sink, Source}

class MyIndexer(indexName:String) {
  def getIndexSource(client:ElasticClient)(implicit actorRefFactory: ActorRefFactory) = Source.fromPublisher(
    client.publisher(search(indexName) (your-query-here) sortByFieldAsc "originalSource.ctime" scroll "5m")
  )
}

呼叫MyIndexer.getIndexSource会给您带来Source[SearchHit]的回馈。然后，您可以将SearchHit转换为域对象，但是通常可以处理Elastic4s的结果（在我的情况下，使用Circe的generic.auto；与使用非流接口时相同，可以使用.to [Domainobject]）。

您可能想知道ActorRefFactory的隐式；那就是akka ActorSystem。如果您使用的是Play框架，则可以通过使用依赖项注入在任何注入的类（即class MyClass @Inject() (implicit sys:ActorSystem)）中请求ActorSystem的实例来免费获得此代码。如果您使用的是普通Scala，则可以在Main函数中执行以下操作：

  private implicit val actorSystem = ActorSystem("some-name-here")
  private implicit val mat:Materializer = ActorMaterializer.create(actorSystem)

并使用隐式参数将这些值穿线到需要的位置。

如何使用此方法获得所有结果的序列的示例（可能并非完全符合您的需求，给出了描述，但是一个很好的示例），其工作方式如下：

import com.sksamuel.elastic4s.circe._
import io.circe.generic.auto._

val source = indexer.getIndexSource(esclient)
val resultFuture = source
  .log("logger-name-here")
  .map(_.to[Yourdomainobject])
  .toMat(Sink.seq[Yourdomainobject])(Keep.right)
  .run()

resultFuture
  .map(resultSeq=>{ do stuff with result seq })
  .recover({
      case err:Throwable=>{handle error}
  })

现在，如果您想有效地进行处理，则需要将处理实现为GraphStages并将其连接到流中。我一直在实现一堆扫描器，这些扫描器可以处理数十万个对象，而每个扫描器仅是一个Main函数，该函数设置并运行执行所有实际处理的流。

我倾向于将逻辑设计为流程图，然后将图表的每个框实现为单独的akka GraphStage，然后将它们固定在一起，并使用诸如Broadcast和Merge之类的内置元素来获得良好的并行处理。

希望这很有用！

Scala：从Elasticsearch获取10000多个文档/消息

1 个答案: