我的问题
如何使用分布式查询(solrj)在分片设置中将(多达~30'000'000个)solr文档导出到csv?
我的策略是批量查询(n天),但我目前每批达到约200,000个文件的限制。
我希望每批能获得1'000'000。
我的设置是一个包含多个分片的solr索引。每个碎片都有一个月的文件。根据时间戳字段将文档添加到分片。我使用shards参数集查询,这通常很有效。
现在我想将文档或某些字段导出到csv文件中。但是有很多文件我的请求失败了。我删除了我的网址,但是请求失败了:
// query I) query march 2013 sharded -> does not work
http://localhost:8080/index/in.part.201301/select/?rows=1000000&
shards=localhost:8080/index/in.part.201303&
wt=csv&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2
索引服务器上的异常:
14:18:55,726 SEVERE [SolrCore] java.lang.NullPointerException
at java.io.StringReader.<init>(StringReader.java:33)
at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203)
at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
at org.apache.solr.search.QParser.getQuery(QParser.java:142)
at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:101)
at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
at com.company.InitializerDispatchFilter.doFilter(InitializerDispatchFilter.java:93)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
at java.lang.Thread.run(Thread.java:662)
19:16:44,638 INFO [SolrCore] [in1.part.201303] webapp=/index path=/select params={} status=500 QTime=2
19:16:44,647 SEVERE [SolrCore] org.apache.solr.common.SolrException: Internal Server Error
Internal Server Error
request: http://localhost:8080/ipc-index/in1.part.201303/select
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)
at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)
at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:421)
at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:393)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
at java.util.concurrent.FutureTask.run(FutureTask.java:138)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
at java.lang.Thread.run(Thread.java:662)
非分片查询有效:
// query II) query march 2013 non sharded --> works
http://localhost:8080/index/in.part.201303/select/?rows=1000000&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2&
wt=csv
和
// query III) sharded query with rows=200000 --> works as well, (rows=210000 does fail like query I)
http://localhost:8080/index/in.part.201301/select/?rows=200000&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2&
wt=csv&
shards=localhost:8080/index/in.part.201303
内存
我不认为问题与内存有关:我的索引服务器vm有1GB内存,如果我将内存减少到256MB并执行查询III)它将执行非常慢并且在内存不足时中止。如果我增加内存查询,我仍然会失败。
此外,如果我使用查询III将更多字段添加到字段列表中,它将始终成功。
在我的客户端(slorj)上,我使用Method.POST发送查询。
有人可以帮忙吗?