我希望Solr始终检索搜索找到的所有文档(我知道Solr不是为此而构建的,但无论如何)我正在使用此代码执行此操作:
...
List<Article> ret = new ArrayList<Article>();
QueryResponse response = solr.query(query);
int offset = 0;
int totalResults = (int) response.getResults().getNumFound();
List<Article> ret = new ArrayList<Article>((int) totalResults);
query.setRows(FETCH_SIZE);
while(offset < totalResults) {
//requires an int? wtf?
query.setStart((int) offset);
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
response = solr.query(query);
List<Article> current = response.getBeans(Article.class);
offset += current.size();
ret.addAll(current);
}
...
这样可行,但如果查询超过1000次点击则相当慢(我在这里读过这篇文章。这是由Solr引起的,因为我每次都设置启动 - 由于某种原因 - 需要一些时间) 。什么是更好(和更快)的方法来做到这一点?
答案 0 :(得分:8)
要改善建议的答案,您可以使用流式响应。特别添加for the case that one fetches all results。正如你在Solr的Jira中所看到的那样,那个人想和你一样做。这已经在Solr 4中实现。
This is also described in Solrj's javadoc
Solr将打包响应并在开始发送响应之前创建一个完整的XML / JSON文档。然后,您的客户需要解压所有这些并将其作为列表提供给您。通过使用流式和并行处理(使用这种排队方法时可以执行此操作),性能应该进一步提高。
是的,你将失去自动bean映射,但由于性能是一个因素,我认为这是可以接受的。
以下是单元测试样本:
public class StreamingTest {
@Test
public void streaming() throws SolrServerException, IOException, InterruptedException {
HttpSolrServer server = new HttpSolrServer("http://your-server");
SolrQuery tmpQuery = new SolrQuery("your query");
tmpQuery.setRows(Integer.MAX_VALUE);
final BlockingQueue<SolrDocument> tmpQueue = new LinkedBlockingQueue<SolrDocument>();
server.queryAndStreamResponse(tmpQuery, new MyCallbackHander(tmpQueue));
SolrDocument tmpDoc;
do {
tmpDoc = tmpQueue.take();
} while (!(tmpDoc instanceof PoisonDoc));
}
private class PoisonDoc extends SolrDocument {
// marker to finish queuing
}
private class MyCallbackHander extends StreamingResponseCallback {
private BlockingQueue<SolrDocument> queue;
private long currentPosition;
private long numFound;
public MyCallbackHander(BlockingQueue<SolrDocument> aQueue) {
queue = aQueue;
}
@Override
public void streamDocListInfo(long aNumFound, long aStart, Float aMaxScore) {
// called before start of streaming
// probably use for some statistics
currentPosition = aStart;
numFound = aNumFound;
if (numFound == 0) {
queue.add(new PoisonDoc());
}
}
@Override
public void streamSolrDocument(SolrDocument aDoc) {
currentPosition++;
System.out.println("adding doc " + currentPosition + " of " + numFound);
queue.add(aDoc);
if (currentPosition == numFound) {
queue.add(new PoisonDoc());
}
}
}
}
答案 1 :(得分:1)
您可以通过增加FETCH_SIZE
来提高效果。既然你得到了所有的结果,那么除非你关心记忆或其他类似的东西,否则分页没有意义。如果1000个结果可能导致内存溢出,我会说你当前的表现看起来非常出色。
所以我会尝试一次性获取所有内容,将其简化为:
//WHOLE_BUNCHES is a constant representing a reasonable max number of docs we want to pull here.
//Integer.MAX_VALUE would probably invite an OutOfMemoryError, but that would be true of the
//implementation in the question anyway, since they were still being stored in the list at the end.
query.setRows(WHOLE_BUNCHES);
QueryResponse response = solr.query(query);
int totalResults = (int) response.getResults().getNumFound(); //If you even still need this figure.
List<Article> ret = response.getBeans(Article.class);
如果你需要保持分页:
您正在执行此第一个查询:
QueryResponse response = solr.query(query);
并且正在填充找到的结果数量,但是您没有通过响应提取任何结果。即使你在这里保持分页,你至少可以在这里消除一个额外的查询。
此:
int left = totalResults - offset;
if(left < FETCH_SIZE) {
query.setRows(left);
}
没必要。 setRows
指定要返回的最大行数,因此请求超过可用行数不会导致任何问题。
最后,无所事事,但我不得不问:如果不是setStart
,你会期望int
采取什么样的论点?
答案 2 :(得分:0)
使用以下逻辑批量获取solr数据以优化solr数据获取查询的性能:
public List<Map<String, Object>> getData(int id,Set<String> fields){
final int SOLR_QUERY_MAX_ROWS = 3;
long start = System.currentTimeMillis();
SolrQuery query = new SolrQuery();
String queryStr = "id:" + id;
LOG.info(queryStr);
query.setQuery(queryStr);
query.setRows(SOLR_QUERY_MAX_ROWS);
QueryResponse rsp = server.query(query, SolrRequest.METHOD.POST);
List<Map<String, Object>> mapList = null;
if (rsp != null) {
long total = rsp.getResults().getNumFound();
System.out.println("Total count found: " + total);
// Solr query batch
mapList = new ArrayList<Map<String, Object>>();
if (total <= SOLR_QUERY_MAX_ROWS) {
addAllData(mapList, rsp,fields);
} else {
int marker = SOLR_QUERY_MAX_ROWS;
do {
if (rsp != null) {
addAllData(mapList, rsp,fields);
}
query.setStart(marker);
rsp = server.query(query, SolrRequest.METHOD.POST);
marker = marker + SOLR_QUERY_MAX_ROWS;
} while (marker <= total);
}
}
long end = System.currentTimeMillis();
LOG.debug("SOLR Performance: getData: " + (end - start));
return mapList;
}
private void addAllData(List<Map<String, Object>> mapList, QueryResponse rsp,Set<String> fields) {
for (SolrDocument sdoc : rsp.getResults()) {
Map<String, Object> map = new HashMap<String, Object>();
for (String field : fields) {
map.put(field, sdoc.getFieldValue(field));
}
mapList.add(map);
}
}