python BeautifulSoup获取div的子项中的所有href

时间:2016-03-19 21:24:26

标签: python beautifulsoup

我是python的新手,我一直试图从这个HTML代码中获取链接和内部文本:

<div class="someclass">
  <ul class="listing">
        <li>
          <a href="http://link1.com" title="">title1</a>
                </li>
        <li>
           <a href="http://link2.com" title="">title2</a>
                 </li>
        <li>
           <a href="http://link3.com" title="">title3</a>
                 </li>
        <li>
           <a href="http://link4.com" title="">title4</a>
                  </li>
  </ul>
</div>

我想要来自href http://link.com和所有链接以及内部文字title

我试过这段代码

    div = soup.find_all('ul',{'class':'listing'})
for li in div:
    all_li = li.find_all('li')
    for link in all_li.find_all('a'):
        print(link.get('href'))

但有人帮助我没有运气

3 个答案:

答案 0 :(得分:3)

问题是您使用的是find_all,它会在您的第二个 forloop 中返回一个列表,您应该使用find()

>>> for ul in soup.find_all('ul', class_='listing'):
...     for li in ul.find_all('li'):
...         a = li.find('a')
...         print(a['href'], a.get_text())
... 
http://link1.com title1
http://link2.com title2
http://link3.com title3
http://link4.com title4

您还可以使用CSS selector代替嵌套 forloop

>>> for a in soup.select('.listing li a'):
...     print(a['href'], a.get_text(strip=True))
... 
http://link1.com title1
http://link2.com title2
http://link3.com title3
http://link4.com title4

答案 1 :(得分:2)

选择ul后获取所有标签,然后从带有标题属性和href的a中提取文本。

from bs4 import BeautifulSoup

html = """<div class="someclass">
  <ul class="listing">
        <li>
          <a href="http://link1.com" title="">title1</a>
                </li>
        <li>
           <a href="http://link2.com" title="">title2</a>
                 </li>
        <li>
           <a href="http://link3.com" title="">title3</a>
                 </li>
        <li>
           <a href="http://link4.com" title="">title4</a>
                  </li>
  </ul>
</div>"""

soup = BeautifulSoup(html,"lxml")
ul = soup.select("ul.listing")[0]
links = [a["href"] for a in ul.select("a[href]")]
text = [a.text for a in ul.select("a[title]")]

哪个会给你:

['title1', 'title2', 'title3', 'title4']
['http://link1.com', 'http://link2.com', 'http://link3.com', 'http://link4.com']

如果您确实有多个与该类匹配的ul:

uls = soup.select("ul.listing")
links = [a["href"] for ul in uls for a in ul.select("a[href]") ]
text = [a.text for ul in uls for a in  ul.select("a[title]")]

print(text)
print(links)

答案 2 :(得分:1)

在您的代码中,<form id="contactForm" method="post" action="scripts/email.php"> <div class="modal-body"> <div class="form-group"> <label for="name">Name</label> <input type="text" name="name" id="name" class="form-control" placeholder="Please enter your full name here." required /> <label for="name">Email</label> <input type="text" name="email" id="email" class="form-control" placeholder="Please enter your email address here." required /> <label for="name">Subject</label> <input type="text" name="subject" id="subject" class="form-control" placeholder="Please enter your subject here." required /> <label for="message">Message</label> <textarea name="message" class="form-control" placeholder="Please enter your message here." required /></textarea> </div> </div> <div class="modal-footer"> <button type="button" class="btn btn-default" data-dismiss="modal">Close</button> <button type="submit" class="btn btn-primary">Submit</button> </div> </form> 实际上是all_li元素的列表。在下一行中,您尝试使用它,就好像它是一个单独的元素:

li

相反,您需要遍历all_li.find_all('a') 的元素并在每个上调用all_li

这样的事情应该有效:

find_all

这将产生

uls = soup.find_all('ul', {'class': 'listing'})
for ul in uls:
    for li in ul.find_all('li'):
        for link in li.find_all('a'):
            url = link.get('href')
            contents = link.text
            print (url, contents)