Question

我试图获取所有不在页脚中的页眉。

因此标头<h3 class="ibm-bold">Discover</h3>应该从刮擦中排除。

<footer role="contentinfo" aria-label="IBM">
   <div class="region region-footer">
   <div id="ibm-footer-module">
    <section role="region" aria-label="Resources">
            <h3 class="ibm-bold">Discover</h3>

我尝试使用此表达式选择应排除的标头，但它不会返回正确的节点。

//*[self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6]/ancestor::footer/text()

我要抓取的页面是这样的：https://www.ibm.com/products/informix/embedded-for-iot?mhq=iot&mhsrc=ibmsearch_a

请帮助

Answer 1

您几乎拥有它。

//*[
  (self::h1 or self::h2 or self::h3 or self::h4 or self::h5 or self::h6)
  and not(ancestor::footer)
]/text()

Answer 2

您可以使用以下提取物来删除碎汤上的页脚标签：

from urllib import urlopen
from bs4 import BeautifulSoup

url ="https://www.ibm.com/products/informix/embedded-for-iot?mhq=iot&mhsrc=ibmsearch_a"
url_open = urlopen(url)
soup = BeautifulSoup(url_open,"html.parser")
[s.extract() for s in soup('footer')] #will remove footer tag
print soup #html source will printed without footer tag

Scrapy：X Path选择祖先不是页脚的所有标头

2 个答案: