Question

在尝试使用solr用spark-streaming索引solrj中的多个文档时遇到错误。我解析并索引的每个记录，每个micro-batch。

在下面的代码中，第一种方法（带标签）按预期起作用。第二种方法（标记）不执行任何操作，也不会失败。

在第一个选项中，我为每个分区建立一条记录索引；无用，但功能强大。在第二种方法中，我将分区的每个元素转换为文档，然后尝试为每个索引编制索引，但是失败了：集合中未显示任何记录。

我使用solrj 4.10和spark-2.2.1。

//method 1
myDStream.foreachRDD { rdd => rdd.foreachPartition { records =>

  val solrServer = new HttpSolrServer(collectionUrl)

  val document = new SolrInputDocument()
  document.addField("key", "someValue")
  ...

  solrServer.add(document)
  solrServer.commit()
}}

//method 2
myDStream.foreachRDD { rdd => rdd.foreachPartition { records =>

  val solrServer = new HttpSolrServer(collectionUrl)

  records.map { record =>

    val document = new SolrInputDocument()
    document.addField("key", record.key)
    ...

    solrServer.add(document)
    solrServer.commit()
  }
}}

我想了解为什么第二种方法不起作用，并找到对多个文档建立索引的解决方案。

Answer 1

解决方案是通过rdd s处理记录：

myDStream.foreachRDD { rdd => rdd.foreach { record =>

  val solrServer = new HttpSolrServer(collectionUrl)

  val document = new SolrInputDocument()
  document.addField("key", record.key)
  ...

  solrServer.add(document)
  solrServer.commit()
}}

有关问题源怀疑的更多信息，请参见上面的EricLavault评论。

索引多个文件

1 个答案: