我们正在处理OTDS个文件。简而言之,它们是包含大量数据的XML,可能超过15GB。
我们选择scalesXml库来有效处理这些文件。
让我举个例子:
<?xml version="1.0" encoding="UTF-8"?>
<Otds UpdateMode="Merge"
xmlns="http://otds-group.org/otds"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Version="1.9.1" xsi:schemaLocation="http://otds-group.org/otds ../xsd/otds.xsd">
<Brands>
...
</Brands>
<Accommodations>
<Accommodation Key="A">
...
<SellingAccom>
...
<PriceItems Key="1">...</PriceItems>
...
</SellingAccom>
...
</Accommodation>
... <!-- A lot of <Accomodation> tags -->
<Accommodation Key="Z">
...
</Accommodation>
<PriceItems Key="Global1"></PriceItems> <!-- Collect all of these -->
<PriceItems Key="Global2"></PriceItems>
</Accommodations>
</Otds>
我们遇到了这个问题。 XML包含许多繁重的<Accomodation>
标记。我们会提取<PriceItems>
个<Accommodations>
标记为直接子项的所有<?xml version="1.0" encoding="UTF-8"?>
<Otds UpdateMode="Merge"
xmlns="http://otds-group.org/otds"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Version="1.9.1" xsi:schemaLocation="http://otds-group.org/otds ../xsd/otds.xsd">
<Brands>
<Brand>EBIWA</Brand>
</Brands>
<Accommodations>
<Accommodation Key="ATH432">
<SellingAccom>
<PriceItems Key="1"></PriceItems>
</SellingAccom>
</Accommodation>
<Accommodation Key="ATH433">
<SellingAccom>
<PriceItems Key="2"></PriceItems>
</SellingAccom>
</Accommodation>
<PriceItems Key="Global"></PriceItems>
</Accommodations>
</Otds>
。
我创建了真正的简化文件:
val ns = Namespace("http://otds-group.org/otds")
val Otds = ns("Otds")
val Accommodations = ns("Accommodations")
val PriceItems = ns("PriceItems")
val Accommodation = ns("Accommodation")
val priceItemsPath = List(Otds, Accommodations, PriceItems)
val xml = pullXml(inputstream, optimisationStrategy = QNameElemTreeOptimisation)
val itr = iterate(priceItemsPath, xml)
for {
priceItems <- itr
} yield {
val parsedJson = parseXml(priceItems)
val result = parsedJson.children.head.extract[PriceItems]
result
}
我目前的做法:
它返回所有PriceItems的Iterator [PriceItems],而不仅仅是预期的最后一个
{{1}}
如何快速提取这个巨大文件末尾的元素,而无需解析整个文件?