我想知道如何使用漂亮的汤从一个网站抓取一个城市(例如伦敦)的多个不同页面,而不必一遍又一遍地重复我的代码。
我的目标是理想地首先抓取与一个城市相关的所有网页
在下面,我的代码:
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.citydis.com'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta", property="configuration")
jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=0"
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
for item in js_dict:
headers = js_dict['searchResults']["tours"]
prices = js_dict['searchResults']["tours"]
for title, price in zip(headers, prices):
title_final = title.get("title")
price_final = price.get("price")["original"]
print("Header: " + title_final + " | " + "Price: " + price_final)
输出如下:
Header: London Travelcard: 1 Tag lang unbegrenzt reisen | Price: 19,44 €
Header: 105 Minuten London bei Nacht im verdecklosen Bus | Price: 21,21 €
Header: Ivory House London: 4 Stunden mittelalterliches Bankett| Price: 58,92 €
Header: London: Themse Dinner Cruise | Price: 96,62 €
它只返回第一页的结果(4个结果),但我想获得伦敦的所有结果(必须超过200个结果)
你可以给我任何建议吗?我想,我必须计算jsonURL上的页面,但不知道该怎么做更新
感谢您的帮助,我能够更进一步。
在这种情况下,我只能抓取一个页面(页面= 0),但我想抓取前10页。因此,我的方法如下:
代码中的相关代码段:
soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta", property="configuration")
page = 0
while page <= 11:
page += 1
jsonUrl = "https://www.citydis.com/s/results.json?&q=Paris& customerSearch=1&page=" + str(page)
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
for item in js_dict:
headers = js_dict['searchResults']["tours"]
prices = js_dict['searchResults']["tours"]
for title, price in zip(headers, prices):
title_final = title.get("title")
price_final = price.get("price")["original"]
print("Header: " + title_final + " | " + "Price: " + price_final)
我得到的结果是一个特定页面,但不是全部。除此之外,我收到一条错误消息。这与我没有收到所有结果的原因有关吗?
输出:
Traceback (most recent call last):
File "C:/Users/Scripts/new.py", line 19, in <module>
AttributeError: 'list' object has no attribute 'update'
感谢您的帮助
答案 0 :(得分:1)
您确实应该确保您的代码示例完整(您缺少导入)并在语法上正确(您的代码包含缩进问题)。在尝试制作一个工作示例时,我想出了以下内容。
import requests, json, bs4
session = requests.Session()
session.cookies.get_dict()
url = 'http://www.getyourguide.de'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = session.get(url, headers=headers)
soup = bs4.BeautifulSoup(response.content, "html.parser")
metaConfig = soup.find("meta", property="configuration")
metaConfigTxt = metaConfig["content"]
csrf = json.loads(metaConfigTxt)["pageToken"]
jsonUrl = "https://www.getyourguide.de/s/results.json?&q=London& customerSearch=1&page=0"
headers.update({'X-Csrf-Token': csrf})
response = session.get(jsonUrl, headers=headers)
js_dict = (json.loads(response.content.decode('utf-8')))
print(js_dict.keys())
for item in js_dict:
headers = js_dict['searchResults']["tours"]
prices = js_dict['searchResults']["tours"]
for title, price in zip(headers, prices):
title_final = title.get("title")
price_final = price.get("price")["original"]
print("Header: " + title_final + " | " + "Price: " + price_final)
这给了我四个以上的结果。
通常,您会发现许多返回JSON的网站都会对其回复进行分页,每页提供固定数量的结果。在这些情况下,每个页面但最后一个页面通常都包含一个键,其值为您提供下一页的URL。循环遍历页面是一件简单的事情,当您检测到缺少该密钥时,break
会退出循环。