如何使用Beautiful Soup让网络抓取器遍历多页搜索结果?

时间:2016-07-14 02:40:10

标签: python scripting web-scraping beautifulsoup

我正在尝试编写一个刮刀来获取以下页面的结果:

https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1

我正在努力获得所有结果,而不只是" A"结果,但我想我可以从一个字母开始,然后贯穿整个字母表。如果有人可以帮助这部分也很棒。

无论如何,我想把所有派对名称都归零,即带有属性类派对名称的元素。

我有以下代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("td", {"class":"party-name"})
for name in nameList:
 print(name.get_text())

但是,这仅适用于一页。结果跨越多个页面。如何在多个页面上完成此操作?

此外,如果您可以帮助获得所有结果,而不仅仅是A,那将会非常棒。

修改 我现在改进了我的代码,可以浏览所有搜索。但是,我仍然无法进入下一页。我尝试过使用page_number ++,但由于页面结果的数量不同,因此不知道在哪里停止。我怎么能进入最后一页的下一页休息???

新守则:

from urllib.request import urlopen
from bs4 import BeautifulSoup

all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
for letter in all_letters:

    page_number = 1
    url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
    html = urlopen(url)
    bsObj = BeautifulSoup(html)
    nameList = bsObj.findAll("td", {"class":"party-name"})

    for name in nameList:
        print(name.get_text())

2 个答案:

答案 0 :(得分:0)

根据我的理解,您想要更改页面上的“starts_with”参数,并迭代所有字母表。如果我对这个问题的理解是正确的,那么这可能会有所帮助。

如果您分析网址,您将得到答案。

url =“https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1

“%253d”之后的字母决定了“starts_with”一词。目前它是'a'因此如果你想迭代只更改网址,它会以'a'开头返回

url = 'https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d' + starts_with + '&page=1'

starts_with可以是字符(a,b,c,...)或字符串(abc,asde,...)

答案 1 :(得分:-1)

我会这样解决(伪代码)

for letter in all_letters:
   page = 1
   while True:
       url = letter + page
       # scrape the page
       # check with bs if there is an a-element with id "NextLink1"
       if not link_to_next_page_found:
           break
       page += 1