使用XQuery解析带有子标题的HTML页面

时间:2014-02-07 14:39:44

标签: parsing xml-parsing xquery

我有一个HTML页面,其结构如下:

<div id="content">
    <h2><span class="heading">Section A</span></h2>
    <p>Content of the section</p>
    <p>More content in the same section</p>
    <div>We can also have divs</div>
    <ul><li>And</li><li>Lists</li><li>Too</li></ul>
    <h3><span class="heading">Sub-section heading</span></h3>
    <p>The content here can be a mixture of divs, ps, lists, etc too</p>
    <h2><span class="heading">Section B</span></h2>
    <p>This is section B's content</p>
    and so on
</div>

我想创建以下XML结构:

<sections>
    <section>
        <heading>Section A</heading>
        <content>
            <p>Content of the section</p>
            <p>More content in the same section</p>
            <div>We can also have divs</div>
            <ul><li>And</li><li>Lists</li><li>Too</li></ul>
        </content>
        <sub-sections>
            <section>
                <heading>Section B</heading>
                <content>
                    <p>This is section B's content</p>
                </content>
            </section>
        </sub-sections>
    </section>
</sections>

我遇到的困难是创建<sub-section>标签。这是我到目前为止,但B节出现在A部分的<content>节点内。我也得到了B部分的<section>节点,但它没有内容。

let $content := //div[@id="content"]
let $headings := $content/(h2|h3|h4|h5|h6)[span[@class="heading"]]
return
  <sections>
  {
    for $heading in $headings
    return
      <section>
        <heading>{$heading/span/text()}</heading>
        <content>
        {
          for $paragraph in $heading/following-sibling::*[preceding-sibling::h2[1] = $heading]
          return
            $paragraph
        }
        </content>
      </section>
  }
  </sections>

提前感谢任何帮助或指示。

2 个答案:

答案 0 :(得分:2)

我首先将数据从一个部分隔离到变量中,然后继续处理:

let $content := //div[@id="content"]
return
  <sections>
  {
    for $heading in $content//h2[span[@class='heading'] ] 
    let $nextHeading := $heading/following-sibling::h2
    let $sectionCntent := $heading/following-sibling::* except ($nextHeading,     $nextHeading/following-sibling::*)
    return
      <section>
        {$sectionContent}
      </section>
  }
  </sections>

这里我只对部分进行了处理,然后您可以通过在$ sectionContent变量上再次执行类似的操作来处理子部分,除非现在您必须做一些有点怪异的选择第一位或者您部分(为另一部分做类似的事情):

$sectionContent except ($sectionContent[self::h3], $sectionContent[self::h3]/following-sibling::*)

答案 1 :(得分:2)

XQuery 3.0 中,您可以使用window clauses非常优雅地对您的部分和子部分进行分组:

<sections>{
  for tumbling window $section in //div[@id = 'content']/*
      start $h2 when $h2 instance of element(h2)
  return <section>{
    <heading>{$h2//text()}</heading>,
    $section/self::h3[1]/preceding-sibling::*,
    <sub-sections>{
      for tumbling window $sub-section in $section
          start $h3 when $h3 instance of element(h3)
      return <section>{
        <heading>{$h3//text()}</heading>,
        tail($sub-section)
      }</section>
    }</sub-sections>
  }</section>
}</sections>