xml-conduit:我怎样才能获得第一个tbody而且只能直接获得该tbody的子项?

时间:2014-07-28 16:54:41

标签: haskell html-parsing xml-conduit

我正在使用xml-conduit和Text.XML.Cursor来导航一些带有嵌套表的可怕的html。有一个带有两个tbody标签的表,我想要第一个tbody的直接子tr标签。到目前为止,这是我的代码:

getIdentityTableBody :: Cursor -> [Cursor]
getIdentityTableBody
  = element "table" >=> hasAttribute "summary" >=>
      attributeIs "summary" "Issuer Identity Information"
      &// element "tbody" >=> child >=> element "tr"

但这会得到两个tbody标签的所有后代。我根本不知道如何单独使用第一个人,并且对于过滤那个直接的孩子而感到困惑。

这是我要解析的html。

<table summary="Issuer Identity Information" width="100%">
  <tbody>
    <tr>
      <th width="33%" class="FormText">CIK (Filer ID Number)</th>
      <th width="10%" class="FormText">Previous Names</th>
      <td width="23%">
        <table border="0" summary="Table with single CheckBox">
          <tbody><tr>
            <td class="CheckBox"><span class="FormData">X</span></td>
            <td align="left" class="FormText">None</td>
          </tr>
        </tbody></table>
      </td>
      <th width="33%" class="FormText">Entity Type</th>
    </tr>
    <tr>
      <td>
        <a href="http://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0001614286">0001614286</a>
      </td>
      <td rowspan="5" colspan="2" valign="top"></td>
      <td rowspan="7" valign="top">
        <table width="100%" border="0" summary="Table with Multiple boxes">
          <tbody><tr>
            <td class="CheckBox">&nbsp;&nbsp;</td>
            <td class="FormText">Corporation</td>
          </tr>
          <tr>
            <td class="CheckBox"><span class="FormData">X</span></td>
            <td class="FormText">Limited Partnership</td>
          </tr>
          <tr>
            <td class="CheckBox">&nbsp;&nbsp;</td>
            <td class="FormText">Limited Liability Company</td>
          </tr>
          <tr>
            <td class="CheckBox">&nbsp;&nbsp;</td>
            <td class="FormText">General Partnership</td>
          </tr>
          <tr>
            <td class="CheckBox">&nbsp;&nbsp;</td>
            <td class="FormText">Business Trust</td>
          </tr>
          <tr>
            <td class="CheckBox">&nbsp;&nbsp;</td>
            <td class="FormText">Other (Specify)</td>
          </tr>
        </tbody></table>
        <br>
      </td>
    </tr>
    <tr>
      <th class="FormText">Name of Issuer</th>
    </tr>
    <tr>
      <td class="FormData">SRA US Equity Fund, LP</td>
    </tr>
    <tr>
      <th class="FormText">Jurisdiction of Incorporation/Organization</th>
    </tr>
    <tr>
      <td class="FormData">DELAWARE</td>
    </tr>
    <tr>
      <th class="FormText" colspan="2">Year of Incorporation/Organization</th>
    </tr>
    <tr>
      <td colspan="3">
        <table border="0" summary="Year of Incorporation/Organization">
          <tbody>
            <tr>
              <td class="CheckBox">&nbsp;&nbsp;</td>
              <td class="FormText">Over Five Years Ago</td>
            </tr>
            <tr>
              <td class="CheckBox"><span class="FormData">X</span></td>
              <td class="FormText">Within Last Five Years (Specify Year)</td>
              <td><span class="FormData">2014</span></td>
            </tr>
            <tr>
              <td class="CheckBox">&nbsp;&nbsp;</td>
              <td class="FormText">Yet to Be Formed</td>
            </tr>
          </tbody>
        </table>
      </td>
    </tr>
  </tbody>
</table>

1 个答案:

答案 0 :(得分:2)

问题是&// element "tbody"说“找到每个tbody后代”,包括其他tbody标签内的tbody标签。如果使用&/代替tbody元素的直接table后代呢?

另外两条评论:

  1. 如果您可以提供一些XML / HTML示例,那将会很有帮助。
  2. 您不需要hasAttributeattributeIs。只要确认该属性具有给定值,也会检查它是否存在。