Question

我在城市维基页面上爬行，需要提取该城市所属的国家/地区。我试图找到包含“国家/地区”一词的<th>，然后返回<tr>而不是<td>中找到它，但问题有几个。

（我的第一个案例的代码有效）

a = doc.xpath("//table[contains(@class, 'infobox')]")
b = a[0].xpath("//table//th[contains(text(),'Country') or contains(text(),'country')]")
country = b[0].xpath("./../td//a//text()")[0].replace(" ", "_")

我知道为什么它对其他情况不起作用，但我不知道如何解决它。

关键字“country”位于<th>

<tr class="mergedtoprow">
      <th scope="row">Country</th>
      <td>
        <a href="/wiki/Poland" title="Poland">Poland</a>
      </td>
</tr>

关键字“country”位于<a> <span> <th>

` Constituent country England

    <tr class="mergedrow">
      <th scope="row">
       <span class="nowrap">
        <a href="/wiki/Countries_of_the_United_Kingdom" title="Countries of the 
         United Kingdom">Constituent country
        </a>
       </span>
      </th>
      <td>
       <span class="flagicon"><img alt="" src="SRC (never mind)" width="23" 
       height="14" class="thumbborder" srcset="SRC (never mind)" />&#160;
       </span>
       <a href="/wiki/England" title="England">England</a>
      </td>
    </tr>

关键字“country”位于<a> <{1}}
```
<th>
```
`

Answer 1

您可以在XPath下面使用以匹配所有提到的案例中所需的th元素：

//th[matches(normalize-space(), "country", "i")]

请注意，"i"标志允许进行不区分大小写的搜索，因此＆＃34; Country＆＃34;和＃34;国家＆＃34;应该匹配

如果您的工具仅支持XPath 1.0，则可以使用

//th[contains(.,'Country') or contains(.,'country')]

使用xpath

1 个答案: