Question

我正在抓取一系列非常平坦的网页，其中对我来说重要的结构是我想找到具有已知id的h2元素之后出现的所有元素。我想在此h2元素后找到的元素是p，blockquote和center。排序很重要，需要在定位这些元素时保留。我还应该说，所有感兴趣的元素都是兄弟姐妹，处于同一组织层面，紧挨着另一个层面。我怎样才能做到这一点？

这是我试过的：

soup = BeautifulSoup(response)
# here is the title
h =  soup.find("h2", {"id":"content"})
print(h.text) # correct, so we're in the right place
print(h.next_sibling)

但最终的print语句只打印None。我也试过这个：

i = h.next
print(i.text)

但这会引发NavigableString错误：

Traceback (most recent call last):
  File "scrape.py", line 15, in <module>
    print(i.text)
  File "/usr/lib/python2.7/dist-packages/BeautifulSoup.py", line 473, in __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % (self.__class__.__name__, attr)
AttributeError: 'NavigableString' object has no attribute 'text'

我正在寻找的元素肯定与此h2元素处于同一级别，并且出现在HTML中的后面。如何在BS导航模式中找到它们？

Answer 1

当你调用h.next_sibling时，BeautifulSoup会返回同一级别的下一个元素。现在，此元素可以是标记或独立字符串。我的猜测是，HTML文档中有一些独立字符串位于您要查找的HTML标记之前。

示例：

html = '<h1>A header</h1>Some random text<p>A paragraph</p>'
soup = BeautifulSoup(html)
h = soup.find('h1') # Contains <h1>A header</h1>
print(h.next_sibling) # Prints u'Some random text', not the p tag

美丽的汤基于订单的搜索

1 个答案: