获取基于文本lxml的列表

时间:2016-06-02 18:45:10

标签: python web-scraping lxml

我有一些类似的HTML:

... 
    <table width="100%">
            <tr class="blueborder">
              <td colspan="2" class="blackbold">Some Other Text</td>
            </tr>
          </table>
          <table width="100%">     
        <tr class="upcoming">
          <td class="lists" >
            <ul>
              <li> List1 Element1</li>
              <li> List1 Element2</li>
              <li> List1 Element3</li>
            </ul>
          </td>
        </tr>
     </table>
      <table width="100%">
        <tr class="blueborder">
          <td colspan="2" class="blackbold">Signaling Text</td>
        </tr>
      </table>
      <table width="100%">
        <tr class="upcoming">
          <td class="lists" >
            <ul>
              <li> List2 Element1</li>
              <li> List2 Element2</li>
              <li> List2 Element3</li>
            </ul>
          </td>
        </tr>
     </table>   
...

我使用的是employees = root.xpath('.//td[@class = "lists"]/ul/li/text()'),但这会抓取两个列表元素。我只想抓住列表2,除了它们具有相同的属性(类等)。唯一的区别是<td colspan="2" class="blackbold">Signaling Text</td>出现在我想要的列表之前。有没有办法表明在此之后才能获得此列表?

1 个答案:

答案 0 :(得分:0)

您可以在tr后面的文本Signaling Text选择以下td:

h = """ <table width="100%">
            <tr class="blueborder">
              <td colspan="2" class="blackbold">Some Other Text</td>
            </tr>
          </table>
          <table width="100%">
        <tr class="upcoming">
          <td class="lists" >
            <ul>
              <li> List1 Element1</li>
              <li> List1 Element2</li>
              <li> List1 Element3</li>
            </ul>
          </td>
        </tr>
     </table>
      <table width="100%">
        <tr class="blueborder">
          <td colspan="2" class="blackbold">Signaling Text</td>
        </tr>
      </table>
      <table width="100%">
        <tr class="upcoming">
          <td class="lists" >
            <ul>
              <li> List2 Element1</li>
              <li> List2 Element2</li>
              <li> List2 Element3</li>
            </ul>
          </td>
        </tr>
     </table>  """

from lxml import html
tree = html.fromstring(h)
print(tree.xpath('//td[contains(.,"Signaling Text")]/following::td[@class = "lists"]/ul/li/text()'))

哪会给你:

[' List2 Element1', ' List2 Element2', ' List2 Element3']

或者,如果您确定这是第二次出现:

tree.xpath('(//td[@class = "lists"])[2]/ul/li/text()')