具有网页限制的网页抓取

时间:2020-08-12 17:49:05

标签: python web-scraping beautifulsoup python-requests data-science

我一直在尝试使用BeautifulSoup刮擦此站点(https://www.americanas.com.br/hotsite/todas-ofertas-mundo)中的产品。我可以将所有项目都放在一个页面中,并且由于分页位于url中,因此只需使用计数器即可移至下一个页面(例如,第2页为https://www.americanas.com.br/hotsite/todas-ofertas-mundo/pagina-2,依此类推)。因此,问题在于该数字页未显示任何产品之后,最大页数为416。由于每个站点都显示24种产品,所以我几乎无法达到1万种产品(根据该页面,总数为400万)。

我试图更深入地研究类别,但遇到了同样的问题(一些更深层次的类别也有超过1万种产品),还用“ marca”,“ price”和“ loja”以及相同的问题进行过滤。因此,即使使用最好的过滤器,我也无法获得所有产品,因为我达到了其中的最大页面。

我还搜索了API,因此可以尝试绕过此API,但是找不到没有产品ID的任何我可以请求目录的东西。我确实找到了一个用于获取不同品牌和销售商,但产品数量相同的问题。

这也是一个困扰我其他市场的问题,这使得很难获得该网站的所有目录,必须进行过滤并尝试获取最多数量的产品,但并非全部。因此,任何建议都将受到欢迎。谢谢!

这里是抓取页面的代码

import requests
import time
import re
import json
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()

def html(url):
    try:
        soup = BeautifulSoup(requests.get(url,verify = False).content,'html.parser',from_encoding="utf-8")
        return soup
    except Exception as e:
        print(e)
        print("Not loading")

def product_info(prod):
    quote = {}
    if prod.find("span",{"class": re.compile(r"UnavailableTextMessage")}):
        return 
    quote['id'] = prod.find("a").get("href").replace("?","/",1).split("/")[2]
    quote['name'] = prod.find("h2").getText()
    #prod_info = prod.find("span")
    quote['price'] = prod.find("span", {"class": re.compile(r"PriceUI-bwhjk3-11")}).getText().replace(".","").split(" ")[-1]
    quote['full_price'] = quote['price']
    quote['discount'] = ''
    discount = prod.find("span",{"class": re.compile(r"TextUI-xlll2j-3")})
    if discount:
        quote["discount"] = discount.getText().replace("%","")
        disc = quote['discount'].replace("%","")
        quote['full_price'] = prod.find("span", {"class": re.compile(r"PriceUI-sc-1q8ynzz-0")}).getText().replace(".","").split(" ")[-1]
    inter = prod.find("span", {"class": re.compile(r"InternationalText")})
    quote['inter'] = 0
    if inter:
        quote['inter'] = 1
    quote['url']= "https://americanas.com.br"+ prod.find("a").get("href")
    return quote



test = "https://www.americanas.com.br/hotsite/todas-ofertas-mundo"
products = [] ##lim 10k
counter = 2
while True:
    page = "/pagina-"+str(counter)
    url = test + page
    counter += 1
    soup = html(url)
    print(url)        
    content = soup.findAll("div",{"class": "product-grid-item"}) 
    if content == []:
        print(counter)
        break
    for cont in content:
        if product_info(cont):
            products.append(product_info(cont))

1 个答案:

答案 0 :(得分:2)

寻找API时,您的想法正确。如果您记录网络流量,并访问其中一个产品页面,则会看到对多个API的请求。

第一个返回产品ID的集合。注意查询字符串参数offsetlimit。在此示例中,我将offset设置为"0"(以便我们从第一个产品开始),将limit设置为"10",以检索以下商品的产品ID:前十个产品:

def main():

    import requests

    url = "https://mystique-v2-americanas.juno.b2w.io/search"

    params = {
        "offset": "0",
        "sortBy": "topSelling",
        "source": "omega",
        "filter": [
            '{"id":"referer","value":"/hotsite/todas-ofertas-mundo","fixed":true,"hidden":true}',
            '{"id":"currency","value":"USD","fixed":true,"name":"moeda","hidden":true}'
        ],
        "limit": "10",
        "suggestion": "true"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    products = response.json()["products"]

    for product in products:
        print(product["id"])

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

158285472
107684121
88842655
88899155
84894032
94728488
107684117
84894015
80349294
84894042
>>> 

将此与其他API结合使用,您可以获取给定产品ID的每种产品的特定信息:

def get_product_info(product_id):

    import requests

    url = "https://restql-server-api-v2-americanas.b2w.io/run-query/catalogo/product-buybox/5"

    params = {
        "c_opn": "",
        "id": product_id,
        "offerLimit": "1",
        "opn": "",
        "tags": "prebf*|SUL_SUDESTE_CENTRO|livros_prevenda"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    info = response.json()

    return info["product"]["result"]["name"], info["installment"]["result"][0][0]["total"]

def main():

    import requests

    url = "https://mystique-v2-americanas.juno.b2w.io/search"

    params = {
        "offset": "0",
        "sortBy": "topSelling",
        "source": "omega",
        "filter": [
            '{"id":"referer","value":"/hotsite/todas-ofertas-mundo","fixed":true,"hidden":true}',
            '{"id":"currency","value":"USD","fixed":true,"name":"moeda","hidden":true}'
        ],
        "limit": "10",
        "suggestion": "true"
    }

    response = requests.get(url, params=params)
    response.raise_for_status()

    products = response.json()["products"]

    for product in products:
        name, price = get_product_info(product["id"])
        print(f"The name is \"{name}\" and the price is {price}.")

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

The name is "Smartwatch Esportivo Blitzwolf ® BW-HL1 ip68 e Multi Idiomas" and the price is 197.17.
The name is "Bebe reborn girafinha" and the price is 466.87.
The name is "Boneca Bebe Reborn 45 Cm corpo todo de Silicone Boneca Menina Reborn Realista bebes cabelo e olhos castanhos NPKDOLL" and the price is 400.28.
The name is "Boneca Bebê Reborn 43cm Corpo Todo Silicone - Menina com Cabelo Cacheado e Ursinho de pelúcia KAYDORA" and the price is 397.48.
The name is "Boneca Bebe Reborn Menina com roupa de Pandinha 47 cm NPKDOLL" and the price is 329.28.
The name is "Fones De Ouvido Sem Fio Bluetooth Xiaomi Redmi Airdots" and the price is 256.2.
The name is "Boneca Bebe Reborn Menino Girafinha 48 Cm Menino com Pelucia Girafa Azul NPKDOLL" and the price is 464.63.
The name is "Boneca Bebê Reborn Menina Realista de Silicone e Algodão 48cm e Girafinha NPKDOLL" and the price is 289.96.
The name is "Mini Caixa de Som Portátil Speaker  a Prova D’Água - Xiaomi" and the price is 165.2.
The name is "Boneca Bebe Reborn Menina princesa com casaco de inverno de coelhinho 45 cm NPKDOLL" and the price is 372.4.
>>> 

您明白了。我实际上并没有尝试将limit查询字符串参数设置为十以外的任何值,因此您可能想尝试一下。