考虑一下html:
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>
我想提取以下标题并将其保存在list
:
Student Notice
UPDATED BUS ROUTE
Application are invited
如何使用urllib2
和BeautifulSoup
?
答案 0 :(得分:0)
你使用哪个python和BeautifulSoup? 在python2x和BS v.3的情况下:
from BeautifulSoup import BeautifulSoup
text = """<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>"""
soup = BeautifulSoup(text)
for link in soup.findAll('a'):
print link.contents[0]
BS 4有点不同:
soup = BeautifulSoup(text, 'html.parser')
for link in soup.find_all('a'):
print link.contents[0]
答案 1 :(得分:0)
如果您已经拥有html,则不需要urllib
... urllib用于向Web服务器发出请求然后返回html,您只需执行此操作即可HTML
>>> from bs4 import BeautifulSoup
>>> a = """<li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Special Exam_Aug_2016.pdf" target="_blank"> Student Notice </a></li>
... <li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Bus_Route_Chart_Aug16.pdf" style="font-size:14px" target="_blank">UPDATED BUS ROUTE </a></li>
... <li><img alt="SangamUniversity-animated" class="newimg" src="images/new_animated.gif" /><a href="pdffiles/Sangam_University_Faculty_Requirement_Aug2016.jpg" target="_blank">Application are invited </a></li>"""
>>> b = BeautifulSoup(a, 'html.parser')
>>> c = b.find_all('li')
>>> for elem in c:
... print(elem.a.string)
...
Student Notice
UPDATED BUS ROUTE
Application are invited