Question

所以我有以下HTML。

<div class="media-body"><i class="" style="text-shadow:1px 1px 0px #dcdcdc;">29 May 2016 </i><a href="http://www.sharesansar.com/events/opening-day-of-auction-of-tinau-development-bank-limited-21903-32-units-ordinary-unclaimed-right-share/"><h4 class="media-heading">Opening Day of auction of Tinau Development Bank Limited 21,903.32 units ordinary unclaimed right share.</h4></a><p>Mini Bid Amt: Rs 100 Mini Application: 100 units or multiply by 10 Opening Date: 16th Jestha, 2073 Closing Date: 30th Jestha, 2073 Bid Opening Date: 31st Jestha, 2073 Time: 3:15 PM Contact: Siddhartha Capital Limited, Anamnagar, Kathmandu, 4257767, 4257768</p></div>

我一直在尝试使用以下代码检索2016年5月29日的日期，但它无法正常工作。

import requests
from bs4 import BeautifulSoup
from urllib.request import urlopen
from urllib.error import HTTPError
def events_log(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.sharesansar.com/events/2016/06/page/'+str(page)+'/'
        try:
            html = urlopen(url)
        except HTTPError as e:
            print(e)
        else:
            if html is None:
                print ("URL is not found")
            else:
                soup = BeautifulSoup(html.read(), 'lxml')
                for name in soup.findAll('i', {'class':''}):
                    print(name.get_text())
events_log(1)

我是完整的菜鸟，自昨天以来一直试图解决这个问题。

Answer 1

请记住增加page counter。通过对代码的简单修改并且无需进行错误检查（由您决定），它可以正常工作：

import requests
from bs4 import BeautifulSoup

def events_log(max_pages):
    page = 1
    while page <= max_pages:
        url = 'http://www.sharesansar.com/events/2016/06/page/'+str(page)+'/'

        res = requests.get(url)

        soup = BeautifulSoup(res.text, 'lxml')
        for name in soup.findAll('i', {'class':''}):
            print(name.get_text())
        page += 1

events_log(1)

输出：

30 Jun 2016 
30 Jun 2016 
29 Jun 2016 
29 Jun 2016 
28 Jun 2016 
28 Jun 2016 
26 Jun 2016 
24 Jun 2016 
24 Jun 2016 
22 Jun 2016

使用findAll检索没有类和id的日期

1 个答案: