从延迟加载页面中抓取数据

时间:2020-04-08 09:57:44

标签: python json web-scraping beautifulsoup python-requests

我正在尝试从this网页上抓取数据,并且能够成功抓取所需的数据。
问题是使用requests下载的页面只有45个产品详细信息,但实际上在该网页上它有4000多种产品,这是由于无法直接获得所有数据而导致的,仅当您向下滚动到该页面时才会显示。< br /> 我想抓取该页面上所有可用的产品。

代码

import requests
from bs4 import BeautifulSoup
import json
import re

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0'}

base_url = "link that i provided"
r = requests.get(base_url,headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

scripts = soup.find_all('script')[11].text
script = scripts.split('=', 1)[1]
script = script.rstrip()
script = script[:-1]

data = json.loads(script) 

skus = list(data['grid']['entities'].keys())

prodpage = []
for sku in skus:
   prodpage.append('https://www.ajio.com{}'.format(data['grid']['entities'][sku]['url']))

print(len(prodpage))   

1 个答案:

答案 0 :(得分:3)

向下滚动表示数据是由JavaScript生成的,因此这里有多个选项 第一个是使用硒 第二个方法是发送相同的Ajax请求,网站使用的方法如下:

def get_source(page_num = 1):
        url = 'https://www.ajio.com/api/category/830216001?fields=SITE&currentPage={}&pageSize=45&format=json&query=%3Arelevance%3Abrickpattern%3AWashed&sortBy=relevance&gridColumns=3&facets=brickpattern%3AWashed&advfilter=true'

        res = requests.get(url.format(1),headers={'User-Agent': 'Mozilla/5.0'})
        if res.status_code == 200 :
                return res.json()
# data = get_source(page_num = 1)
# total_pages = data['pagination']['totalPages'] # total pages are 111
prodpage = []
for i in range(1,112):
        print(f'Getting page {i}')
        data = get_source(page_num = i)['products']
        for item in data:
                prodpage.append('https://www.ajio.com{}'.format(item['url']))
        if i == 3: break
print(len(prodpage)) # output 135 for 3 pages