我正在尝试编写一个刮刀来获取以下页面的结果:
我正在努力获得所有结果,而不只是" A"结果,但我想我可以从一个字母开始,然后贯穿整个字母表。如果有人可以帮助这部分也很棒。
无论如何,我想把所有派对名称都归零,即带有属性类派对名称的元素。
我有以下代码:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253da&page=1")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("td", {"class":"party-name"})
for name in nameList:
print(name.get_text())
但是,这仅适用于一页。结果跨越多个页面。如何在多个页面上完成此操作?
此外,如果您可以帮助获得所有结果,而不仅仅是A,那将会非常棒。
修改 我现在改进了我的代码,可以浏览所有搜索。但是,我仍然无法进入下一页。我尝试过使用page_number ++,但由于页面结果的数量不同,因此不知道在哪里停止。我怎么能进入最后一页的下一页休息???
新守则:
from urllib.request import urlopen
from bs4 import BeautifulSoup
all_letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o","p","q","r","s","t","u","v", "w", "x", "y", "z", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9"]
for letter in all_letters:
page_number = 1
url = "https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d" + letter + "&page=" + str (page_number)
html = urlopen(url)
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("td", {"class":"party-name"})
for name in nameList:
print(name.get_text())
答案 0 :(得分:0)
根据我的理解,您想要更改页面上的“starts_with”参数,并迭代所有字母表。如果我对这个问题的理解是正确的,那么这可能会有所帮助。
如果您分析网址,您将得到答案。
“%253d”之后的字母决定了“starts_with”一词。目前它是'a'因此如果你想迭代只更改网址,它会以'a'开头返回
url = 'https://www.co.dutchess.ny.us/CountyClerkDocumentSearch/Search.aspx?q=nco1%253d2%2526name1%253d' + starts_with + '&page=1'
starts_with
可以是字符(a,b,c,...)或字符串(abc,asde,...)
答案 1 :(得分:-1)
我会这样解决(伪代码)
for letter in all_letters:
page = 1
while True:
url = letter + page
# scrape the page
# check with bs if there is an a-element with id "NextLink1"
if not link_to_next_page_found:
break
page += 1