我正在尝试从this网页上抓取数据,并且能够成功抓取所需的数据。
问题是使用requests
下载的页面只有45个产品详细信息,但实际上在该网页上它有4000多种产品,这是由于无法直接获得所有数据而导致的,仅当您向下滚动到该页面时才会显示。< br />
我想抓取该页面上所有可用的产品。
代码
import requests
from bs4 import BeautifulSoup
import json
import re
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}
base_url = "link that i provided"
r = requests.get(base_url,headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
scripts = soup.find_all('script')[11].text
script = scripts.split('=', 1)[1]
script = script.rstrip()
script = script[:-1]
data = json.loads(script)
skus = list(data['grid']['entities'].keys())
prodpage = []
for sku in skus:
prodpage.append('https://www.ajio.com{}'.format(data['grid']['entities'][sku]['url']))
print(len(prodpage))
答案 0 :(得分:3)
向下滚动表示数据是由JavaScript生成的,因此这里有多个选项 第一个是使用硒 第二个方法是发送相同的Ajax请求,网站使用的方法如下:
def get_source(page_num = 1):
url = 'https://www.ajio.com/api/category/830216001?fields=SITE¤tPage={}&pageSize=45&format=json&query=%3Arelevance%3Abrickpattern%3AWashed&sortBy=relevance&gridColumns=3&facets=brickpattern%3AWashed&advfilter=true'
res = requests.get(url.format(1),headers={'User-Agent': 'Mozilla/5.0'})
if res.status_code == 200 :
return res.json()
# data = get_source(page_num = 1)
# total_pages = data['pagination']['totalPages'] # total pages are 111
prodpage = []
for i in range(1,112):
print(f'Getting page {i}')
data = get_source(page_num = i)['products']
for item in data:
prodpage.append('https://www.ajio.com{}'.format(item['url']))
if i == 3: break
print(len(prodpage)) # output 135 for 3 pages