Question

在SOLR中快速检索大型数据集的最佳方法是什么？

我的索引有1000万条记录（6个字符串字段）。我使用的查询和过滤器将结果集降低到270万条记录，我希望以编程方式分页并获取另一个进程的数据。

目前我使用SOLRJ和cursorMark一次获得300000条记录。每个查询需要15-20秒。有没有办法提高速度？减少＆＃34;块的大小＆＃34;似乎没有效果。将300000减少到50000的含义使得查询更快，但是它们中的更多并且总时间相当。

我认为问题是SOLR必须获得整个2.7mil的结果集，然后在每次调用时将间隔需求减少。将其与＆＃34;尺寸相结合＆＃34;结果集，我可以理解为什么它很慢。我正在寻找加速它的一些想法。

我的SOLRJ代码如下：

Solr版本：4.10.2

SolrQuery query = new SolrQuery();
query.setQuery("*:*");
query.setFilterQueries("text:\"*SEARCH STUFF*\"");
query.setParam("fl","id,srfCode");
query.setStart(0);
query.setRows(300000);
query.setSort("sortId", SolrQuery.ORDER.asc);
query.set("cursorMark", "*");

UPDATE 我尝试了以下尝试＆＃34; stream＆＃34; solr的数据。不幸的是，查询本身仍然是获取数据的瓶颈。一旦我拥有它，我可以快速处理它。但我仍然需要一种更快的方式来获取数据。

package org.search.builder;

import java.io.IOException;
import java.util.concurrent.BlockingQueue;
import java.util.concurrent.LinkedBlockingQueue;

import org.apache.solr.client.solrj.SolrQuery;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.StreamingResponseCallback;
import org.apache.solr.client.solrj.impl.HttpSolrServer;
import org.apache.solr.common.SolrDocument;
import org.junit.Test;

public class SolrStream {

     long startTime = 0;
     long endTime = 0;

      @Test
      public void streaming() throws SolrServerException, IOException, InterruptedException {
        long overallstartTime = System.currentTimeMillis();
        startTime = System.currentTimeMillis();

        HttpSolrServer server = new HttpSolrServer("https://solrserver/solr/indexname");
        SolrQuery tmpQuery = new SolrQuery();
        tmpQuery.setQuery("*:*");
        tmpQuery.setFilterQueries("text:\"*SEARCH STUFF*\"");
        tmpQuery.setParam("fl","id,srfCode");
        tmpQuery.setStart(0);
        tmpQuery.setRows(300000);
        tmpQuery.set("cursorMark", "*");
        //Sort needs to be unique or have tie breakers.  In this case rowId will never be a duplicate
        //If you can have duplicates then you need a tie breaker (sort should include a second column to sort on)
        tmpQuery.setSort("rowId", SolrQuery.ORDER.asc);
        final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
        server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
        SolrDocument tmpDoc;
        do {
          tmpDoc = tmpQueue.take();
        } while (!(tmpDoc instanceof StopDoc));

        System.out.println("Overall Time: " + (System.currentTimeMillis() - overallstartTime) + " ms");
      }

      private class StopDoc extends SolrDocument {
        // marker to finish queuing
      }

      private class MyCallbackHander extends StreamingResponseCallback {
        private BlockingQueue<SolrDocument> queue;
        private long currentPosition;
        private long numFound;

        public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
          queue = aQueue;
        }

        @Override
        public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
          // called before start of streaming
          // probably use for some statistics
          currentPosition = aStart;
          numFound = aNumFound;
          if (numFound == 0) {
            queue.add(new StopDoc());
          }
        }

        @Override
        public void streamSolrDocument(SolrDocument aDoc) {
          currentPosition++;
          if (queue.size() % 50000 == 0)
          {
              System.out.println("adding doc " + currentPosition + " of " + numFound);
              System.out.println("Overall Time: " + (System.currentTimeMillis() - startTime) + " ms");
              startTime = System.currentTimeMillis();

          }
          queue.add(aDoc);
          if (currentPosition == numFound) {
            queue.add(new StopDoc());
          }
        }
      }
}

Answer 1

MatsLindh对导出请求处理程序的建议工作得很好。

将此requestHandler添加到solrconfig（如果尚未存在）

  <requestHandler name="/export" class="solr.SearchHandler">
    <lst name="invariants">
      <str name="rq">{!xport}</str>
      <str name="wt">xsort</str>
      <str name="distrib">false</str>
    </lst>

    <arr name="components">
      <str>query</str>
    </arr>
  </requestHandler>

然后这样称呼它： / export？q = rowId：[1 TO 4000]＆amp; fq = text：\＆＃34; STUFF \＆＃34;＆amp; fl = field1，field2＆amp; sort = sortColumn asc

*您需要排序并拥有一组fl

现在我只需要弄清楚如何让/ export在solrcloud设置中工作。

谢谢！

在SOLR中快速检索大型数据集

1 个答案: