如何使用BeautifulSoup python获取html页面中的链接标题?

时间:2016-09-05 20:00:45

标签: python web-scraping beautifulsoup urllib2

考虑一下html:

<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>

我想提取以下标题并将其保存在list

Student Notice
UPDATED BUS ROUTE
Application are invited

如何使用urllib2BeautifulSoup

进行操作

2 个答案:

答案 0 :(得分:0)

你使用哪个python和BeautifulSoup? 在python2x和BS v.3的情况下:

from BeautifulSoup import BeautifulSoup

text = """<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>"""

soup = BeautifulSoup(text)

for link in soup.findAll('a'):
    print link.contents[0]

BS 4有点不同:

soup = BeautifulSoup(text, 'html.parser')

for link in soup.find_all('a'):
    print link.contents[0]

答案 1 :(得分:0)

如果您已经拥有html,则不需要urllib ... urllib用于向Web服务器发出请求然后返回html,您只需执行此操作即可HTML

>>> from bs4 import BeautifulSoup
>>> a = """<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
... <li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
... <li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>"""
>>> b = BeautifulSoup(a, 'html.parser')
>>> c = b.find_all('li')

>>> for elem in c:
...     print(elem.a.string)
... 
 Student Notice 
UPDATED BUS ROUTE 
Application are invited