懒惰解析巨大的XML中的元素

时间:2016-10-10 10:07:25

标签: xml scala xml-parsing scales-xml

我们正在处理OTDS个文件。简而言之,它们是包含大量数据的XML,可能超过15GB。

我们选择scalesXml库来有效处理这些文件。

让我举个例子:

<?xml version="1.0" encoding="UTF-8"?>
<Otds UpdateMode="Merge"
xmlns="http://otds-group.org/otds"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
Version="1.9.1" xsi:schemaLocation="http://otds-group.org/otds ../xsd/otds.xsd">
 <Brands>
     ...
 </Brands>
 <Accommodations>
  <Accommodation Key="A">
   ...
   <SellingAccom>
    ...
    <PriceItems Key="1">...</PriceItems>
    ...
   </SellingAccom>
   ...
  </Accommodation>

...  <!-- A lot of <Accomodation> tags -->

  <Accommodation Key="Z">
  ...
  </Accommodation>
  <PriceItems Key="Global1"></PriceItems>   <!-- Collect all of these     -->
  <PriceItems Key="Global2"></PriceItems>
 </Accommodations>
</Otds>

我们遇到了这个问题。 XML包含许多繁重的<Accomodation>标记。我们会提取<PriceItems><Accommodations>标记为直接子项的所有<?xml version="1.0" encoding="UTF-8"?> <Otds UpdateMode="Merge" xmlns="http://otds-group.org/otds" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" Version="1.9.1" xsi:schemaLocation="http://otds-group.org/otds ../xsd/otds.xsd"> <Brands> <Brand>EBIWA</Brand> </Brands> <Accommodations> <Accommodation Key="ATH432"> <SellingAccom> <PriceItems Key="1"></PriceItems> </SellingAccom> </Accommodation> <Accommodation Key="ATH433"> <SellingAccom> <PriceItems Key="2"></PriceItems> </SellingAccom> </Accommodation> <PriceItems Key="Global"></PriceItems> </Accommodations> </Otds>

我创建了真正的简化文件:

val ns = Namespace("http://otds-group.org/otds")
val Otds = ns("Otds")
val Accommodations = ns("Accommodations")
val PriceItems = ns("PriceItems")
val Accommodation = ns("Accommodation")

val priceItemsPath = List(Otds, Accommodations, PriceItems)

val xml = pullXml(inputstream, optimisationStrategy = QNameElemTreeOptimisation)

val itr = iterate(priceItemsPath, xml)

for {
  priceItems <- itr
} yield {
  val parsedJson = parseXml(priceItems)
  val result = parsedJson.children.head.extract[PriceItems]
  result
}

我目前的做法:

  1. 它返回所有PriceItems的Iterator [PriceItems],而不仅仅是预期的最后一个

    {{1}}
  2. 如何快速提取这个巨大文件末尾的元素,而无需解析整个文件?

0 个答案:

没有答案