我正在开发一个应用程序,需要在2小时内获取多达100万份文档
我们正在使用Java客户端API和结构化查询来执行搜索。但是,查询仍然很慢。
代码如下:
def fetchPostMessages(dbParam: DbParam): Page = {
val queryManager = dbClient.newQueryManager()
val sqb: StructuredQueryBuilder = queryManager.newStructuredQueryBuilder()
log.info(s "Fetching post messages from database for params: {}", dbParam)
val modifiedQueryDef = dbParam.param.map {
param => {
sqb.and(
sqb.word(sqb.jsonProperty(status), toBeReported),
sqb.word(sqb.jsonProperty(dataCategory), "dataCategory1"),
sqb.range(sqb.jsonProperty(creationDate), marklogicDateFormat.name, Operator.LE, DateUtil.printFpmlDateTime(param.messagesTime)))
}
}.getOrElse(sqb.and(sqb.word(sqb.jsonProperty(status.name), toBeReported.name)))
modifiedQueryDef.setCollections(XmlConstants.ItracMessageTypes.OUTPUT_MESSAGE.name)
modifiedQueryDef.setOptionsName(sortOption)
search(modifiedQueryDef, dbParam.pageNum, dbParam.batchSize)
}
private def search(queryDef: QueryDefinition, startIndex: Int, batchSize: Int): Page = {
val dataList: ListBuffer[Document] = new ListBuffer()
val jsonDocManager = dbClient.newJSONDocumentManager()
jsonDocManager.setMetadataCategories(Metadata.ALL)
jsonDocManager.setPageLength(
if (batchSize < pageLength) batchSize
else pageLength)
val documentPage = jsonDocManager.search(queryDef, startIndex);
dataList.++ = (extractContent(documentPage))
val totalSize = documentPage.getTotalSize
log.info(s "Total documents to be reported : ${totalSize}")
var pageSize = documentPage.getPageSize
while (pageSize < batchSize && pageSize <= totalSize) {
if (batchSize - pageSize < pageSize)
jsonDocManager.setPageLength(batchSize - pageSize)
var newDocPage = jsonDocManager.search(queryDef, pageSize + 1)
dataList.++ = (extractContent(newDocPage))
pageSize = pageSize + newDocPage.getPageSize
}
log.info("Total messages fetched are : {}", dataList.size)
Page(startIndex, totalSize - batchSize, dataList.to[collection.immutable.Seq])
}
排序选项包括:
<search:options xmlns:search="http://marklogic.com/appservices/search">
<search:sort-order type="xs:string" direction="ascending">
<search:json-property>subdomLvl1</search:json-property>
</search:sort-order>
<search:sort-order type="xs:string" direction="ascending">
<search:json-property>trdId</search:json-property>
</search:sort-order>
<search:sort-order type="xs:string" direction="ascending">
<search:json-property>validStartDate</search:json-property>
</search:sort-order>
<search:sort-order type="xs:string" direction="ascending">
<search:json-property>ver</search:json-property>
</search:sort-order>
<search:sort-order type="xs:string" direction="ascending">
<search:json-property>reportStatus</search:json-property>
</search:sort-order>
</search:options>
对数据库建立索引如下:
元素范围索引用于 - status,dataCategory和creationDate以及所有排序选项
答案 0 :(得分:1)
如果进程不需要文档元数据,请考虑使用jsonDocManager.clearMetadataCategories()而不是jsonDocManager.setMetadataCategories(Metadata.ALL)进行配置。这种方法将减少服务器和客户端的工作量,并减少传输的数据。
可以通过测试newDocPage.hasNextPage()来简化循环 - 请参阅:
http://docs.marklogic.com/guide/java/bulk#id_21619
客户端是否可以在文档到达时将文档流式传输到消费流程,而不是在单个列表中累积所有百万个文档?这肯定会提高吞吐量。
您可能还会考虑使用Data Movement SDK来读取多个线程中的文档:
http://docs.marklogic.com/guide/java/data-movement#id_60613
http://docs.marklogic.com/javadoc/client/com/marklogic/client/datamovement/QueryBatcher.html
希望有帮助,