如何使用Beautiful Soup进行网页抓取时迭代多个结果页面

时间:2016-07-16 04:16:19

标签: python scripting web-scraping beautifulsoup

我有一个我写过的脚本,我使用Beautiful Soup来搜索搜索结果的网站。我已设法通过其类名隔离我想要的数据。

但是,搜索结果不在单个页面上。相反,它们分布在多个页面上,所以我希望得到它们。我想让我的脚本能够检查是否有下一个结果页面并在那里运行自己。由于结果的数量不同,我不知道有多少页的结果存在,因此我无法预定义一个范围来迭代。我还试图使用' if_page_exists'校验。但是,如果我将页码放在结果范围之外,页面总是存在的,它只是没有任何结果,但有一个页面表示没有结果可以显示。

然而,我注意到每个页面结果都有一个' Next'具有id' NextLink1'的链接并且最后一页结果没有这个。所以我认为那可能是神奇的。但我不知道如何以及在何处实施该检查。我一直在获得无限循环和东西。

下面的脚本会查找搜索字词' x'的结果。非常感谢协助。

from urllib.request import urlopen
from bs4 import BeautifulSoup

#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
all_letters= ['x']
for letter in all_letters:

    page_number = 1
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
    html = urlopen(url)
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("td", {"class":"party-name"})

    for name in nameList:
        print(name.get_text())

另外,有没有人知道实例化一个字母数字字符列表的方法比我在上面脚本中注释的更好?

1 个答案:

答案 0 :(得分:1)

试试这个:

from urllib.request import urlopen
from bs4 import BeautifulSoup


#all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
all_letters= ['x']
pages = []

def get_url(letter, page_number):
    return "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)

def list_names(soup):
    nameList = soup.findAll("td", {"class":"party-name"})
    for name in nameList:
        print(name.get_text())

def get_soup(letter, page):
    url = get_url(letter, page)
    html = urlopen(url)
    return BeautifulSoup(html)

def main():
    for letter in all_letters:
        bsObj = get_soup(letter, 1)

        sel = bsObj.find('select', {"name": "ctl00$ctl00$InternetApplication_Body$WebApplication_Body$SearchResultPageList1"})    
        for opt in sel.findChildren("option", selected = lambda x: x != "selected"):
            pages.append(opt.string)

        list_names(bsObj)

        for page in pages:
            bsObj = get_soup(letter, page)
            list_names(bsObj)
main()

main()函数中,从第一页get_soup(letter, 1)开始,我们在列表中找到并存储包含所有页码的选项选项值。

接下来,我们遍历页码以从下一页提取数据。