感谢这里有很多时间在这里被问过,但我似乎无法让它为我工作。
我已经写了一个刮刀,它成功地从网站的第一页抓取了我需要的一切。但是,我无法弄清楚如何让它循环遍历各个页面。
网址只是增加,如BLAH / 3 +' page = x'
我还没有学习很长时间的代码,所以任何建议都会受到赞赏!
import requests
from bs4 import BeautifulSoup
url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3'
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
name = print(item.contents[0].text)
address = print(item.contents[1].text.replace('.',''))
care_type = print(item.contents[2].text)
更新
r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3')
for page in range(10):
r = requests.get('http://www.URL.org/BLAH1/BLAH2/BLAH3' + 'page=' + page)
soup = BeautifulSoup(r.content, "html.parser")
#print(soup.prettify())
# String substitution for HTML
for link in soup.find_all("a"):
"<a href='>%s'>%s</a>" %(link.get("href"), link.text)
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
name = print(item.contents[0].text)
address = print(item.contents[1].text.replace('.',''))
care_type = print(item.contents[2].text)
更新2!:
import requests
from bs4 import BeautifulSoup
url = 'http://www.URL.org/BLAH1/BLAH2/BLAH3&page='
for page in range(10):
r = requests.get(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
答案 0 :(得分:1)
要使用page=x
来循环网页,您需要for
这样的循环&gt;
import requests
from bs4 import BeautifulSoup
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10&page='
for page in range(10):
print('---', page, '---')
r = requests.get(url + str(page))
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
每个页面都可以不同,更好的解决方案需要更多关于页面的信息。有时您可以链接到最后一页,然后您可以在10
range(10)
如果没有到下一页的链接,您可以使用while True
循环并break
离开循环。但首先你需要显示这个页面(网址到真实页面)。
编辑:示例如何获取下一页的链接,然后您获得所有页面 - 不仅仅是以前版本中的10页。
import requests
from bs4 import BeautifulSoup
# link to first page - without `page=`
url = 'http://www.housingcare.org/housing-care/results.aspx?ath=1%2c2%2c3%2c6%2c7&stp=1&sm=3&vm=list&rp=10'
# only for information, not used in url
page = 0
while True:
print('---', page, '---')
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
# String substitution for HTML
for link in soup.find_all("a"):
print("<a href='>%s'>%s</a>" % (link.get("href"), link.text))
# Fetch and print general data from title class
general_data = soup.find_all('div', {'class' : 'title'})
for item in general_data:
print(item.contents[0].text)
print(item.contents[1].text.replace('.',''))
print(item.contents[2].text)
# link to next page
next_page = soup.find('a', {'class': 'next'})
if next_page:
url = next_page.get('href')
page += 1
else:
break # exit `while True`