全部<a>, children of a certain </a> <div> <a>

时间:2016-02-23 10:40:18

标签: python parsing beautifulsoup

I want to loop with BeautifulSoup over all <a href=...> that are included in a <h2>, themselves in a <div class="myclass"> :

<a href="www.example.com">Not selected</a> 
<div class="myclass">
  <a href="www.example.com">Not selected</a> 
  <h2>
    <a href="www.example.com">SELECTED!</a> 
  </h2>
</div>

I was thinking about something like this, but I can imagine that BeautifulSoup can do such filtering without any if link.parent == ... tests:

from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen(req), "lxml")

for link in soup.select('a[href]'):
    if link.parent == ...   # tests
       print link

How to do this with BeautifulSoup?

4 个答案:

答案 0 :(得分:2)

您可以一步一步地findAll前进到您想要的a

for div in soup.findAll("div", attrs={"class": "myclass"}):
    for h2 in div.findAll("h2"):
        for a in h2.findAll("a"):
            print a

或者您可以在select中使用css选择器:

soup.select('.myclass h2 a')

答案 1 :(得分:1)

使用css选择器:

soup.select('div h2 a')

答案 2 :(得分:0)

Beautiful soup支持CSS类选择器relevant documentation

所以你可以按照以下方式进行查询:

soup.find_all('.myclass > h2 > a')

因此,所有锚标签都是标题的子节点,它们是div的子节点。

答案 3 :(得分:0)

你可以这样做

divs = soup.findAll('div', {'class': 'myclass'})
for div in divs:
    links = div.findAll('h2 > a')
    for link in links:
        print link