Question

我正在使用ML 9

在MarkLogic数据库中，有280万个xml文档。我只想获得所有独特的元素名称。

由于数据库大小太大，最好的是什么？获取唯一元素名称的最快方法是什么？

Answer 1

您可以运行CORB job，从URIs模块中选择数据库中的所有URI，然后使用name()或local-name()中的PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask或POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask返回不同的元素名称列表。处理模块，使用EXPORT-FILE-SORT=ascending|distinct选项将所有输出写入单个文件，使用# Inline module to select all URIs URIS-MODULE=INLINE-XQUERY|xdmp:estimate(fn:doc()), cts:uris("",(),cts:true-query()) # Inline module to return a distinct list of element names in the document on a separate line PROCESS-MODULE=INLINE-XQUERY|declare variable $URI as xs:string external; string-join(fn:distinct-values(fn:doc($URI)//*/name()),"
") # Write the results of each process module to a single file PROCESS-TASK=com.marklogic.developer.corb.ExportBatchToFileTask EXPORT-FILE-NAME=element-names.txt # After the batch processing is completed, sort and dedup the element names POST-BATCH-TASK=com.marklogic.developer.corb.PostBatchUpdateFileTask EXPORT-FILE-SORT=ascending|distinct THREAD-COUNT=10和(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)选项进行重复数据删除，并从数据库中生成不同的元素名称列表在文本文件中。

包含所有必要选项的示例作业，XCC-CONNECTION-URI除外：

UInt64

当DB大小太大时，如何从存储在MarkLogic DB中的XML中获取唯一元素名称？

1 个答案: