所以我有以下HTML。
<div class="media-body"><i class="" style="text-shadow:1px 1px 0px #dcdcdc;">29 May 2016 </i><a href="http://www.sharesansar.com/events/opening-day-of-auction-of-tinau-development-bank-limited-21903-32-units-ordinary-unclaimed-right-share/"><h4 class="media-heading">Opening Day of auction of Tinau Development Bank Limited 21,903.32 units ordinary unclaimed right share.</h4></a><p>Mini Bid Amt: Rs 100 Mini Application: 100 units or multiply by 10 Opening Date: 16th Jestha, 2073 Closing Date: 30th Jestha, 2073 Bid Opening Date: 31st Jestha, 2073 Time: 3:15 PM Contact: Siddhartha Capital Limited, Anamnagar, Kathmandu, 4257767, 4257768</p></div>
我一直在尝试使用以下代码检索2016年5月29日的日期,但它无法正常工作。
import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
def events_log(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.sharesansar.com/events/2016/06/page/'+str(page)+'/'
try:
html = urlopen(url)
except HTTPError as e:
print(e)
else:
if html is None:
print ("URL is not found")
else:
soup = BeautifulSoup(html.read(), 'lxml')
for name in soup.findAll('i', {'class':''}):
print(name.get_text())
events_log(1)
我是完整的菜鸟,自昨天以来一直试图解决这个问题。
答案 0 :(得分:1)
请记住增加page counter
。通过对代码的简单修改并且无需进行错误检查(由您决定),它可以正常工作:
import requests
from bs4 import BeautifulSoup
def events_log(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.sharesansar.com/events/2016/06/page/'+str(page)+'/'
res = requests.get(url)
soup = BeautifulSoup(res.text, 'lxml')
for name in soup.findAll('i', {'class':''}):
print(name.get_text())
page += 1
events_log(1)
输出:
30 Jun 2016
30 Jun 2016
29 Jun 2016
29 Jun 2016
28 Jun 2016
28 Jun 2016
26 Jun 2016
24 Jun 2016
24 Jun 2016
22 Jun 2016