所以我一般不熟悉编码,但对于我的第一个项目,我尝试创建一个监视器来监控 Shopify 网站的产品更改。
我的方法是在线获取公开共享的代码,然后从那里向后工作以理解它,所以我在更广泛的类中得到了以下代码,它似乎通过循环浏览页面来获取 products.json。
但是当我加载 https://www.hanon-shop.com/collections/all/products.json 然后在下面打印我的商品列表时,前几个产品是不同的,这有什么意义?
def scrape_site(self):
"""
Scrapes the specified Shopify site and adds items to array
:return: None
"""
self.items = []
s = rq.Session()
page = 1
while page > 0:
try:
html = s.get(self.url + '?page=' + str(page) + '&limit=250', headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
output = json.loads(html.text)['products']
if output == []:
page = 0
else:
for product in output:
product_item = [{'title': product['title'], 'image': product['images'][0]['src'], 'handle': product['handle'], 'variants':product['variants']}]
self.items.append(product_item)
logging.info(msg='Successfully scraped site')
page += 1
except Exception as e:
logging.error(e)
page = 0
time.sleep(0.5)
s.close()
答案 0 :(得分:0)
Requests 接受一个 dict 参数并且还有一个 json 方法,所以这可以更清晰。
import time
import requests
def scrape_site(self):
self.items = []
page = 1
with requests.Session() as s:
while True:
params = {
'page': page,
'limit': 250
}
try:
r = s.get(self.url, params=params, headers=self.headers, proxies=self.proxy, verify=False, timeout=20)
r.raise_for_status()
output = r.json()
if not output:
break
for product in output['products']:
product_item = {
'title': product['title'],
'image': product['images'][0]['src'],
'handle': product['handle'],
'variants':product['variants']
}
self.items.append(product_item)
logging.info(f'Successfully scraped page {page}')
page += 1
time.sleep(1)
except Exception as e:
logging.error(e)
break
return self.items