网页抓取时如何获得以欧元计的爱尔兰价格值?

时间:2021-04-25 14:40:34

标签: python web-scraping

在执行下面的 python 代码时,我从英国网站获得了一个以磅 (£) 为单位的值列表。如何从爱尔兰(IE)网站获取价格列表。干杯。

import requests
import pandas as pd
from bs4 import BeautifulSoup


price = []

def asos(soup_in):
    # price
    price_div = soup_in.find_all( class_='qU9n4CQ')


    for container in price_div:
        container = container.text
        print(container) # Displays Pounds(£)

url = "https://www.asos.com/men/t-shirts-vests/cat/?cid=7616&nlid=mw|clothing|shop+by+product|t-shirts+%26+vests"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:86.0) Gecko/20100101 Firefox/86.0"}
results = requests.get(url, headers=headers)
soup = BeautifulSoup(results.text, "html.parser")
asos(soup)


asos_t_shirt = pd.DataFrame({
    'Prices': price

})

当我在链接上时 - https://www.asos.com/men/t-shirts-vests/cat/?cid=7616&nlid=mw|clothing|shop+by+product|t-shirts+%26+vests 它以欧元显示爱尔兰价格,但当我执行代码时,我收到以英镑为单位的价值。任何帮助将不胜感激

输出:

£10.00
£18.00
£18.00
£25.00
£22.00
£20.00
ETC...

更新代码

import requests

query = """query {
    data {
    product {
      name
    }
  }
}"""

url = 'https://www.zalando.ie/api/graphql/'
r = requests.post(url, json={'query': query})
print(r.status_code)
print(r.text)

1 个答案:

答案 0 :(得分:1)

我在浏览器中查看了该页面。滚动到页面底部,有一个“加载更多”按钮。我记录了浏览器的网络流量并按下了“加载更多”按钮,发现我的浏览器发出了各种 XHR HTTP GET 请求,其中之一是返回 JSON 的 REST API,其中包含您可能想要的所有产品信息(包括价格)。这并不少见,因为这是实施了多少现代在线商店。产品信息通过 API 收集,然后使用 JavaScript 异步填充 DOM。虽然第一个“页面”上的产品——访问商店时立即可见的产品——被直接烘焙到 HTML 中,而不是从 API 中检索,但有点奇怪。不过,我们也可以通过 API 检索这些产品:

def get_products():
    import requests

    api_url = "https://www.asos.com/api/product/search/v2/categories/7616"

    params = {
        "channel": "desktop-web",
        "country": "GB",
        "currency": "GBP",
        "keyStoreDataversion": "hnm9sjt-28",
        "lang": "en-GB",
        "limit": "72",
        "offset": "0",
        "rowlength": "4",
        "store": "COM"
    }

    response = requests.get(api_url, params=params)
    response.raise_for_status()

    return response.json()["products"]
    

def main():

    products = get_products()

    print("Discovered {} product(s).".format(len(products)))

    for product in products:
        print("\"{}\" - ({})".format(product["name"], product["price"]["current"]["text"]))
    
    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

Discovered 72 product(s).
"ASOS DESIGN oversized t-shirt with crew neck in navy" - (£10.00)
"ASOS DESIGN organic relaxed long sleeve t-shirt with colour block sleeves" - (£18.00)
"ASOS DESIGN long sleeve t-shirt with cut and sew panels in grey" - (£18.00)
"ASOS DESIGN 2 pack long sleeve sleeve waffle t-shirt" - (£25.00)
"ASOS DESIGN 5 pack t-shirt with crew neck" - (£26.00)
"Nike World Tour Pack graphic oversized t-shirt in black" - (£29.95)
"ASOS DESIGN organic muscle fit t-shirt with crew neck in black" - (£4.80)
"New Look t-shirt with crew neck in brown" - (£3.25)
"Vans Off The Wall Classic t-shirt in pink" - (£20.15)
"Original Penguin small logo t-shirt slim fit in black" - (£20.00)
"ASOS DESIGN knitted vest with floral design in khaki" - (£22.00)
"ASOS DESIGN oversized t-shirt in tie-dye organic cotton with smile chest print" - (£20.00)
...

它现在的编写方式实际上并不能解决您的货币问题——它只是一种更好的获取数据的方式(而不是用 BeautifulSoup 或 Selenium 抓取数据)。您会认为以欧元获取价格就像更改 "currency" 查询字符串参数字典中的 params 键值对一样简单,而且(几乎)就这么简单。只是将 "currency""GBP" 更改为 "EUR" 会给出 400 响应,这意味着我们的请求没有正确表述。事实证明,此 API 不喜欢 "currency""country""lang""store" 键值对之间的任何差异。换句话说,我们必须更改所有这四个查询字符串参数,以便 API 接受我们的请求,而且这四个参数都必须有意义。

例如,我将查询字符串参数更改为以下内容,使其看起来像是在德国商店购物,以便我们可以获取以欧元为单位的价格:

params = {
    "channel": "desktop-web",
    "country": "DE",
    "currency": "EUR",
    "keyStoreDataversion": "hnm9sjt-28",
    "lang": "de-DE",
    "limit": "72",
    "offset": "0",
    "rowlength": "4",
    "store": "DE"
}

使用应用的更改再次运行脚本,我们得到以下输出:

Discovered 72 product(s).
"ASOS DESIGN – Langärmliges Shirt mit Einsätzen im Patchwork-Design in Grau" - (22,99 €)
"ASOS DESIGN – 2er-Pack langärmlige Shirts mit Waffelstruktur" - (31,99 €)
"ASOS Daysocial – Oversized T-Shirt in Blau mit akzentuiertem Batikmuster" - (24,99 €)
"ASOS DESIGN – Strick-Trägershirt im Blumendesign in Khaki" - (27,99 €)
"ASOS DESIGN – Oversize-T-Shirt aus Bio-Baumwolle mit Batikmuster und Smiley-Print auf der Brust" - (25,99 €)
"ASOS Daysocial – Oversize-T-Shirt mit Blumen- und Logoprints auf der Vorder- und Rückseite in Grün, Kombiteil" - (22,99 €)
"ASOS Daysocial – Oversize-T-Shirt mit mehreren bunten Sonnen- und Logoprints in Blaugrün" - (24,99 €)
"ASOS Daysocial – Oversize-T-Shirt mit lila und blauem Batikmuster" - (22,99 €)
"Reclaimed Vintage – Inspired – Überfärbtes Oversize-T-Shirt in Anthrazit" - (28,99 €)
"Reclaimed Vintage – Inspired – Verwaschenes Oversized-T-Shirt mit Logo" - (28,99 €)
"ASOS DESIGN – Oversize-T-Shirt in gebrochenem Weiß mit „Paris”-Cityprint" - (14,99 €)
...

如您所见,现在价格以欧元为单位(好),但产品名称已更改为相应的德语版本(不好)。我想真正的解决方案将涉及向 API 发出两个请求 - 一个使用英文查询字符串参数(用于收集产品名称),另一个使用德语查询字符串参数(以欧元为单位的价格信息)。同样重要的是要注意,产品在第一个(英语)请求中出现的顺序与它们在第二个(德语)请求中出现的顺序不同。我认为可以通过查看每个检索到的产品的 ID,通过产品 ID 匹配来自两个请求的名称和价格来解决这个问题。


编辑 - 刚刚编写了一个脚本来收集所有英语和德语产品。总共有8993个英语产品和8755个德语产品。在两组产品之间,6552 共享相同的产品 ID。这意味着,不仅两家商店的产品数量不同,而且两家商店的产品也各不相同。因此,似乎有 6552 种产品理论上可以获得英文名称和相关的德国价格。


编辑 - 获得爱尔兰产品很有意义,因为您可以获得以欧元为单位的英文产品名称和价格信息。

对于我们向 API 发出的每个请求,我们一次最多可以请求 200 个产品的信息(这个限制似乎是由 API 设置的)。您将在下面找到更新的代码。 get_products 现在是一个生成器,可以一次生成 200 个产品的列表。我们在 main 函数中将所有这些产品列表累积在一个巨大的列表中。我们使用 itertools.chain.from_iterable 来“解压”列表,这样我们就得到了一个巨大的产品列表,而不是一个列表:

def get_products():
    import requests
    import itertools

    api_url = "https://www.asos.com/api/product/search/v2/categories/7616"

    limit = 200

    params = {
        "channel": "desktop-web",
        "country": "IE",
        "currency": "EUR",
        "keyStoreDataversion": "hnm9sjt-28",
        "lang": "en-GB",
        "limit": str(limit),
        "rowlength": "4",
        "store": "ROE"
    }

    count = itertools.count(0, step=limit)

    for offset in map(str, count):
        params["offset"] = offset
        response = requests.get(api_url, params=params)
        response.raise_for_status()

        products = response.json()["products"]
        print("Getting next {} products...".format(len(products)))
        if len(products) != limit:
            yield products
            break
        yield products

def main():

    from itertools import chain

    print("Retrieving IE products.\n")

    products = list(chain.from_iterable(get_products()))

    print("Discovered {} product(s) in total.\n".format(len(products)))

    for product in products:
        print("\"{}\" - ({})".format(product["name"], product["price"]["current"]["text"]))

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

输出:

Retrieving IE products.

Getting next 200 products...
Getting next 200 products...
Getting next 200 products...
Getting next 200 products...
...
Getting next 200 products...
Getting next 200 products...
Getting next 200 products...
Getting next 152 products...
Discovered 8752 product(s) in total.

"ASOS DESIGN organic relaxed long sleeve t-shirt with colour block sleeves" - (€22.99)
"ASOS DESIGN long sleeve t-shirt with cut and sew panels in grey" - (€22.99)
"ASOS DESIGN 2 pack long sleeve sleeve waffle t-shirt" - (€31.99)
"ASOS DESIGN knitted vest with floral design in khaki" - (€27.99)
"ASOS DESIGN oversized t-shirt in tie-dye organic cotton with smile chest print" - (€25.99)
"ASOS Daysocial oversized t-shirt with placement tie dye in blue" - (€24.99)
"ASOS Daysocial co-ord oversized t-shirt with front and back flower logo prints in green" - (€22.99)
"ASOS Daysocial oversized t-shirt with multi placement sun and logo prints in teal" - (€24.99)
"COLLUSION oversized long sleeve t-shirt with print in acid wash pique fabric" - (€20.99)
...
相关问题