My question

如何使用分布式查询(solrj)在分片设置中将(多达~30'000'000个)solr文档导出到csv?

我的策略是分批查询(n天)但我目前达到每批约200,000个文件的限制 .

我希望每批能获得1'000'000 .

我的设置是一个包含多个分片的solr索引 . 每个碎片都有一个月的文件 . 根据时间戳字段将文档添加到分片 . 我使用shards参数集查询,这通常很有效 .

现在我想将文档或一些字段导出到csv文件中 . 但是有很多文件我的请求失败了 . 我删除了我的网址,但是请求失败了:

// query I) query march 2013 sharded -> does not work

http://localhost:8080/index/in.part.201301/select/?rows=1000000&
shards=localhost:8080/index/in.part.201303&
wt=csv&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2

索引服务器上的异常:

14:18:55,726 SEVERE [SolrCore] java.lang.NullPointerException
    at java.io.StringReader.<init>(StringReader.java:33)
    at org.apache.lucene.queryParser.QueryParser.parse(QueryParser.java:203)
    at org.apache.solr.search.LuceneQParser.parse(LuceneQParserPlugin.java:80)
    at org.apache.solr.search.QParser.getQuery(QParser.java:142)
    at org.apache.solr.handler.component.QueryComponent.prepare(QueryComponent.java:101)
    at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:173)
    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1372)
    at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
    at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
    at com.company.InitializerDispatchFilter.doFilter(InitializerDispatchFilter.java:93)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
    at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
    at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.java:190)
    at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
    at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextEstablishmentValve.java:126)
    at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEstablishmentValve.java:70)
    at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java:158)
    at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
    at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:829)
    at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:598)
    at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
    at java.lang.Thread.run(Thread.java:662)

19:16:44,638 INFO  [SolrCore] [in1.part.201303] webapp=/index path=/select params={} status=500 QTime=2 
19:16:44,647 SEVERE [SolrCore] org.apache.solr.common.SolrException: Internal Server Error

Internal Server Error

request: http://localhost:8080/ipc-index/in1.part.201303/select
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:432)
    at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:246)
    at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:421)
    at org.apache.solr.handler.component.HttpCommComponent$1.call(SearchHandler.java:393)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:439)
    at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
    at java.util.concurrent.FutureTask.run(FutureTask.java:138)
    at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
    at java.lang.Thread.run(Thread.java:662)

非分片查询有效:

// query II) query march 2013 non sharded --> works
http://localhost:8080/index/in.part.201303/select/?rows=1000000&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2&
wt=csv

// query III) sharded query with rows=200000 --> works as well, (rows=210000 does fail like query I)
http://localhost:8080/index/in.part.201301/select/?rows=200000&
fl=id&
q=firstTimestamp_dis:[2013-03-01T00:00:00Z+TO+2013-04-01T00:00:00Z]&
version=2.2&
wt=csv&
shards=localhost:8080/index/in.part.201303

Memory
我不认为问题与内存有关:我的索引服务器vm有1GB内存,如果我将内存减少到256MB并执行查询III)它将执行非常慢并且在内存不足时中止 . 如果我增加内存查询,我仍然会失败 .

此外,如果我使用查询III将更多字段添加到字段列表中,它将始终成功 .

在我的客户端(slorj)上,我使用Method.POST发送查询 .

有人可以帮忙吗?