Solr Performance为许多文档查询

时间:2013-04-02 07:37:49

标签: java performance solr lucene

我希望Solr始终检索搜索找到的所有文档(我知道Solr不是为此而构建的,但无论如何)我正在使用此代码执行此操作:

    ...
    List<Article> ret = new ArrayList<Article>();
    QueryResponse response = solr.query(query);
    int offset = 0;
    int totalResults = (int) response.getResults().getNumFound();
    List<Article> ret = new ArrayList<Article>((int) totalResults);
    query.setRows(FETCH_SIZE);
    while(offset < totalResults) {
        //requires an int? wtf?
        query.setStart((int) offset);
        int left = totalResults - offset;
        if(left < FETCH_SIZE) {
            query.setRows(left);
        }
        response = solr.query(query);
        List<Article> current = response.getBeans(Article.class);
        offset += current.size();
        ret.addAll(current);
    }
   ...

这样可行,但如果查询超过1000次点击则相当慢(我在这里读过这篇文章。这是由Solr引起的,因为我每次都设置启动 - 由于某种原因 - 需要一些时间) 。什么是更好(和更快)的方法来做到这一点?

3 个答案:

答案 0 :(得分:8)

要改善建议的答案,您可以使用流式响应。特别添加for the case that one fetches all results。正如你在Solr的Jira中所看到的那样,那个人想和你一样做。这已经在Solr 4中实现。

This is also described in Solrj's javadoc

Solr将打包响应并在开始发送响应之前创建一个完整的XML / JSON文档。然后,您的客户需要解压所有这些并将其作为列表提供给您。通过使用流式和并行处理(使用这种排队方法时可以执行此操作),性能应该进一步提高。

是的,你将失去自动bean映射,但由于性能是一个因素,我认为这是可以接受的。

以下是单元测试样本:

public class StreamingTest {

  @Test
  public void streaming() throws SolrServerException, IOException, InterruptedException {
    HttpSolrServer server = new HttpSolrServer("http://your-server");
    SolrQuery tmpQuery = new SolrQuery("your query");
    tmpQuery.setRows(Integer.MAX_VALUE);
    final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
    server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
    SolrDocument tmpDoc;
    do {
      tmpDoc = tmpQueue.take();
    } while (!(tmpDoc instanceof PoisonDoc));
  }

  private class PoisonDoc extends SolrDocument {
    // marker to finish queuing
  }

  private class MyCallbackHander extends StreamingResponseCallback {
    private BlockingQueue<SolrDocument> queue;
    private long currentPosition;
    private long numFound;

    public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
      queue = aQueue;
    }

    @Override
    public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
      // called before start of streaming
      // probably use for some statistics
      currentPosition = aStart;
      numFound = aNumFound;
      if (numFound == 0) {
        queue.add(new PoisonDoc());
      }
    }

    @Override
    public void streamSolrDocument(SolrDocument aDoc) {
      currentPosition++;
      System.out.println("adding doc " + currentPosition + " of " + numFound);
      queue.add(aDoc);
      if (currentPosition == numFound) {
        queue.add(new PoisonDoc());
      }
    }
  }
}

答案 1 :(得分:1)

您可以通过增加FETCH_SIZE来提高效果。既然你得到了所有的结果,那么除非你关心记忆或其他类似的东西,否则分页没有意义。如果1000个结果可能导致内存溢出,我会说你当前的表现看起来非常出色。

所以我会尝试一次性获取所有内容,将其简化为:

//WHOLE_BUNCHES is a constant representing a reasonable max number of docs we want to pull here.
//Integer.MAX_VALUE would probably invite an OutOfMemoryError, but that would be true of the
//implementation in the question anyway, since they were still being stored in the list at the end.
query.setRows(WHOLE_BUNCHES);
QueryResponse response = solr.query(query);
int totalResults = (int) response.getResults().getNumFound(); //If you even still need this figure.
List<Article> ret = response.getBeans(Article.class);

如果你需要保持分页:

您正在执行此第一个查询:

QueryResponse response = solr.query(query);

并且正在填充找到的结果数量,但是您没有通过响应提取任何结果。即使你在这里保持分页,你至少可以在这里消除一个额外的查询。

此:

int left = totalResults - offset;
if(left < FETCH_SIZE) {
    query.setRows(left);
}

没必要。 setRows指定要返回的最大行数,因此请求超过可用行数不会导致任何问题。

最后,无所事事,但我不得不问:如果不是setStart,你会期望int采取什么样的论点?

答案 2 :(得分:0)

使用以下逻辑批量获取solr数据以优化solr数据获取查询的性能:

public List<Map<String, Object>> getData(int id,Set<String> fields){
        final int SOLR_QUERY_MAX_ROWS = 3;
        long start = System.currentTimeMillis();
        SolrQuery query = new SolrQuery();
        String queryStr = "id:" + id;
        LOG.info(queryStr);
        query.setQuery(queryStr);
        query.setRows(SOLR_QUERY_MAX_ROWS);
        QueryResponse rsp = server.query(query, SolrRequest.METHOD.POST);
        List<Map<String, Object>> mapList = null;
        if (rsp != null) {
            long total = rsp.getResults().getNumFound();
            System.out.println("Total count found: " + total);
            // Solr query batch
            mapList = new ArrayList<Map<String, Object>>();
            if (total <= SOLR_QUERY_MAX_ROWS) {
                addAllData(mapList, rsp,fields);
            } else {
                int marker = SOLR_QUERY_MAX_ROWS;
                do {
                    if (rsp != null) {
                        addAllData(mapList, rsp,fields);
                    }
                    query.setStart(marker);
                    rsp = server.query(query, SolrRequest.METHOD.POST);
                    marker = marker + SOLR_QUERY_MAX_ROWS;
                } while (marker <= total);
            }
        }

        long end = System.currentTimeMillis();
        LOG.debug("SOLR Performance: getData: " + (end - start));

        return mapList;
    }

private void addAllData(List<Map<String, Object>> mapList, QueryResponse rsp,Set<String> fields) {
            for (SolrDocument sdoc : rsp.getResults()) {
                Map<String, Object> map = new HashMap<String, Object>();
            for (String field : fields) {
                map.put(field, sdoc.getFieldValue(field));
            }
            mapList.add(map);
        }
    }