Question

I want to loop with BeautifulSoup over all <a href=...> that are included in a <h2>, themselves in a <div class="myclass"> :

<a href="www.example.com">Not selected</a> 
<div class="myclass">
  <a href="www.example.com">Not selected</a> 
  <h2>
    <a href="www.example.com">SELECTED!</a> 
  </h2>
</div>

I was thinking about something like this, but I can imagine that BeautifulSoup can do such filtering without any if link.parent == ... tests:

from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(req), "lxml")

for link in soup.select('a[href]'):
    if link.parent == ...   # tests
       print link

How to do this with BeautifulSoup?

Answer 1

您可以一步一步地findAll前进到您想要的a：

for div in soup.findAll("div", attrs={"class": "myclass"}):
    for h2 in div.findAll("h2"):
        for a in h2.findAll("a"):
            print a

或者您可以在select中使用css选择器：

soup.select('.myclass h2 a')

Answer 2

使用css选择器：

soup.select('div h2 a')

Answer 3

Beautiful soup支持CSS类选择器relevant documentation

所以你可以按照以下方式进行查询：

soup.find_all('.myclass > h2 > a')

因此，所有锚标签都是标题的子节点，它们是div的子节点。

Answer 4

你可以这样做

divs = soup.findAll('div', {'class': 'myclass'})
for div in divs:
    links = div.findAll('h2 > a')
    for link in links:
        print link

全部<a>, children of a certain </a> <div> <a>

4 个答案: