硒和beautifulsoup刮板非常不一致

时间:2020-10-23 23:58:27

标签: python api selenium web-scraping beautifulsoup

我在stockx网站上做了一个刮板。该程序从模型弹出表中刮除销售数据。解析出售价,将所有价格加在一起,得出平均售价。问题在于,每次我运行该程序时,它都会给出一系列平均值。

from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys 
from itertools import islice

PATH = "C:\Program Files (x86)\chromedriver.exe"

driver = webdriver.Chrome(PATH)
driver.get("https://stockx.com/supreme-patchwork-mohair-cardigan-multicolor")
time.sleep(3)
driver.find_element_by_xpath('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[2]/div/div[2]/div/div/button').click()
driver.find_element_by_xpath('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[2]/div/div[1]/div[2]/a').click()
time.sleep(3)

while EC.element_to_be_clickable((By.LINK_TEXT,'Load More')):
    try:
        driver.find_element_by_css_selector('body > div:nth-child(55) > div > div > div > div.modal-body > div > button').click()
    except Exception as e:
        print(e)
        break
    
src = driver.page_source
soup = BeautifulSoup(src, features='html.parser')
table = soup.table
table_rows = table.find_all('tr')
raw_price_data = list()
bs = list()
for row in islice(table_rows, 1, None):
    td = row.find_all('td')[1]
    raw_price_data.append(td.text[1:])
raw_price_data = list(map(int, raw_price_data))
Sum = sum(raw_price_data)
avg = Sum / len(raw_price_data)
total_sales = len(raw_price_data)

avg = round(avg, 2)
print(f'Total Sales:{total_sales}')
print(f'Average Profit:{avg}')

我最近添加了总销售清单,以告诉我beutifulsoup废料有多少行,并且每次都与我想像的不同。

总销量:1390 平均利润:402.29

总销量:990 平均利润:400.05

总销量:2270 平均利润:407.36

这些是运行该程序后的所有不同结果。 我是python的新手,非常感谢webscrapping任何帮助。

这就是我使用未公开的API能够编写的代码。我使用开发工具上的“网络”标签找到的API变量。对于我发现的相同请求,PAYLOAD是查询参数。尽管我仍然无法获取我想要的python数据,只是错误代码。

import requests
from  urllib.parse import urlencode


headers = {
    'authority': 'stockx.com',
    'appos': 'web',
    'x-requested-with': 'XMLHttpRequest',
    'authorization': '',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'appversion': '0.1',
    'accept': '*/*',
    'sec-fetch-site': 'same-origin',
    'sec-fetch-mode': 'cors',
    'sec-fetch-dest': 'empty',
    'referer': 'https://stockx.com/supreme-patchwork-mohair-cardigan-multicolor',
    'accept-language': 'en-US,en;q=0.9',
    'if-none-match': 'W/^\\^c3f-9c/EjYGTDiuj1w0OlHbjycsHHYU^\\^',
}

PAYLOAD = {
    "state": "480",
    "currency": "USD",
    "limit": 203,
    "page": "1",
    "sort": "createdAt",
    "order": "DESC",
    "country": "US"
}

api = 'https://stockx.com/api/products/509c6166-53d4-49bf-9221-fc10cb298911/chart'
response = requests.get(f'{api}{urlencode(PAYLOAD)}', headers=headers)
print(response)

2 个答案:

答案 0 :(得分:1)

抛弃seleniumBeautifulSoup而选择纯requests怎么样?

如何?好吧,您正在抓取该网页的API。您只需要一个产品的sku号。你怎么得到的?

您在产品的页面源代码中进行梳理,只是发现有一堆<script>标记,它们看起来像JSON数据。很好,对吧?

此外,您意识到sku在这些标记之一中,并且始终在modelcolor值之间。为什么不regex淘汰呢?

然后,将sku放到API网址中,解析响应并计算平均值。

将所有内容放在一起:

import re
from datetime import datetime
from urllib.parse import urlencode

import requests


PRODUCT_URL = "https://stockx.com/supreme-brushed-mohair-cardigan-black"
PRODUCT_NAME = " ".join(i.title() for i in PRODUCT_URL.split('/')[-1].split('-'))

PAYLOAD = {
    "start_date": "all",
    "end_date": "2020-10-24",
    "intervals": 100,
    "format": "highstock",
    "currency": "USD"
}
HEADERS = {
    "referer": PRODUCT_URL,
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
                  "(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
    "x-requested-with": "XMLHttpRequest",
}


product_page = requests.get(PRODUCT_URL, headers=HEADERS).text
product_sku = "".join(re.findall(r'"sku":"(.+)","color', product_page))

api_url = f"https://stockx.com/api/products/{product_sku}/chart?"
response = requests.get(f"{api_url}{urlencode(PAYLOAD)}", headers=HEADERS).json()

print(f"{response['title']['text']} for {PRODUCT_NAME}:")
series = response["series"][0]["data"]
for item in series:
    timestamp, price = item
    human_readable_date = datetime.fromtimestamp(int(timestamp / 1000))
    print(f"{human_readable_date} - {price}")

print("-" * 25)
print(f"Average: {sum(i[1] for i in series) / len(series)} {PAYLOAD['currency']}")

这将输出:

Average price over time for Supreme Brushed Mohair Cardigan Black:
2020-10-22 17:18:19 - 250
2020-10-22 17:49:13 - 250
2020-10-22 18:20:07 - 250
2020-10-22 18:51:01 - 250
2020-10-22 19:21:56 - 250
2020-10-22 19:52:50 - 250
2020-10-22 20:23:44 - 250
2020-10-22 20:54:38 - 250
2020-10-22 21:25:33 - 250
2020-10-22 21:56:27 - 250
2020-10-22 22:27:21 - 250
2020-10-22 22:58:15 - 250
2020-10-22 23:29:10 - 250
2020-10-23 00:00:04 - 250
2020-10-23 00:30:58 - 250
2020-10-23 01:01:53 - 200
2020-10-23 01:32:47 - 219
2020-10-23 02:03:41 - 219
2020-10-23 02:34:35 - 219
2020-10-23 03:05:30 - 219
2020-10-23 03:36:24 - 219
2020-10-23 04:07:18 - 219
2020-10-23 04:38:12 - 219
2020-10-23 05:09:07 - 219
2020-10-23 05:40:01 - 219
2020-10-23 06:10:55 - 219
2020-10-23 06:41:50 - 219
2020-10-23 07:12:44 - 219
2020-10-23 07:43:38 - 219
2020-10-23 08:14:32 - 219
2020-10-23 08:45:27 - 219
2020-10-23 09:16:21 - 219
2020-10-23 09:47:15 - 219
2020-10-23 10:18:09 - 219
2020-10-23 10:49:04 - 219
2020-10-23 11:19:58 - 219
2020-10-23 11:50:52 - 219
2020-10-23 12:21:46 - 219
2020-10-23 12:52:41 - 219
2020-10-23 13:23:35 - 219
2020-10-23 13:54:29 - 219
2020-10-23 14:25:24 - 219
2020-10-23 14:56:18 - 219
2020-10-23 15:27:12 - 257
2020-10-23 15:58:06 - 257
2020-10-23 16:29:01 - 315
2020-10-23 16:59:55 - 315
2020-10-23 17:30:49 - 315
2020-10-23 18:01:43 - 315
2020-10-23 18:32:38 - 315
2020-10-23 19:03:32 - 315
2020-10-23 19:34:26 - 315
2020-10-23 20:05:21 - 315
2020-10-23 20:36:15 - 315
2020-10-23 21:07:09 - 315
2020-10-23 21:38:03 - 315
2020-10-23 22:08:58 - 315
2020-10-23 22:39:52 - 315
2020-10-23 23:10:46 - 315
2020-10-23 23:41:40 - 315
2020-10-24 00:12:35 - 315
2020-10-24 00:43:29 - 315
2020-10-24 01:14:23 - 315
2020-10-24 01:45:18 - 315
2020-10-24 02:16:12 - 315
2020-10-24 02:47:06 - 315
2020-10-24 03:18:00 - 315
2020-10-24 03:48:55 - 315
2020-10-24 04:19:49 - 315
2020-10-24 04:50:43 - 315
2020-10-24 05:21:37 - 315
2020-10-24 05:52:32 - 315
2020-10-24 06:23:26 - 315
2020-10-24 06:54:20 - 315
2020-10-24 07:25:14 - 315
2020-10-24 07:56:09 - 277
2020-10-24 08:27:03 - 277
2020-10-24 08:57:57 - 277
2020-10-24 09:28:52 - 277
2020-10-24 09:59:46 - 277
2020-10-24 10:30:40 - 277
2020-10-24 11:01:34 - 277
2020-10-24 11:32:29 - 277
2020-10-24 12:03:23 - 277
2020-10-24 12:34:17 - 277
2020-10-24 13:05:11 - 277
2020-10-24 13:36:06 - 277
2020-10-24 14:07:00 - 277
2020-10-24 14:37:54 - 277
2020-10-24 15:08:49 - 277
2020-10-24 15:39:43 - 277
2020-10-24 16:10:37 - 277
2020-10-24 16:41:31 - 277
2020-10-24 17:12:26 - 277
2020-10-24 17:43:20 - 277
2020-10-24 18:14:14 - 277
2020-10-24 18:45:08 - 277
2020-10-24 19:16:03 - 277
2020-10-24 19:46:57 - 277
2020-10-24 20:17:51 - 277
-------------------------
Average: 267.52 USD

赏金:

这适用于任何产品网址! XD

编辑:

要获得activity端点的响应,请尝试以下操作:

import re

import requests
from urllib.parse import urlencode


PRODUCT_URL = "https://stockx.com/supreme-brushed-mohair-cardigan-black"
PRODUCT_NAME = " ".join(i.title() for i in PRODUCT_URL.split('/')[-1].split('-'))


HEADERS = {
    "referer": PRODUCT_URL,
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
    "x-requested-with": 'XMLHttpRequest',
}

PAYLOAD = {
    "state": "480",
    "currency": "USD",
    "limit": 10,
    "sort": "createdAt",
    "order": "DESC",
    "country": "US"
}


product_page = requests.get(PRODUCT_URL, headers=HEADERS).text
product_sku = "".join(re.findall(r'"sku":"(.+)","color', product_page))

api_url = f"https://stockx.com/api/products/{product_sku}/activity?"
response = requests.get(f'{api_url}{urlencode(PAYLOAD)}', headers=HEADERS).json()

for item in response:
    print(f"{item['chainId']} - shoe size: {item['shoeSize']} at {item['amount']} {item['localCurrency']}")

输出:

13451189134512711854 - shoe size: M at 250.6639 USD
13454493613262000326 - shoe size: XL at 305 USD
13454719168825677535 - shoe size: M at 250.451 USD
13454432070832901351 - shoe size: XL at 321.4601 USD
13454370874521531956 - shoe size: XL at 315 USD
13454370577625857582 - shoe size: XL at 320 USD
13450796013705700403 - shoe size: M at 240 USD
...

答案 1 :(得分:0)

所以我终于明白了。 我决定扔掉漂亮的汤和硒,只是在再次阅读@badukers帖子后表示感谢,然后使用request和urllib。尽管badukers方法帮助我分配了正确的路,但他使用的API并没有为我提供我一直在寻找平均利润的数据。在对未公开的API进行了一些研究之后,我能够浏览开发工具以找到对API的请求,我需要/ activity而不是/ chart API。找到正确的API之后,我不得不弄清楚如何使用参数,因此我尝试使用url,直到获得203笔销售。那时,很容易将其转移到python,只需确保我选择了正确的参数和所需的标头即可。我仍然不确定的一件事是用户代理的含义,以及如果脚本未包含但为什么其他所有内容都可以排除的原因,脚本为什么会失败。

这是我的代码,用于收集我要寻找的开衫商品的平均总利润。

import requests
from urllib.parse import urlencode


API_URL = 'https://stockx.com/api/products/509c6166-53d4-49bf-9221-fc10cb298911/activity?'


PAYLOAD = {
    "state": "480",
    "currency": "USD",
    "limit": 203,
    "page": "1",
    "sort": "createdAt",
    "order": "DESC",
    "country": "US"
}

HEADERS = {
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
    'referer': 'https://stockx.com/supreme-patchwork-mohair-cardigan-multicolor',
}

response = requests.get(f'{API_URL}' + urlencode(PAYLOAD), headers=HEADERS).json()


ProductActivity = response['ProductActivity']
amount = [i['amount'] for i in ProductActivity]
tprofit = list()
for price in amount:
    profit = price - 188
    tprofit.append(profit)
    print(f'Sale Price:Profit--------------{price}:{profit}')
print('-'*40)
ave = sum(tprofit)/len(amount)
print(f'Raw Sale Average--------------{ave}')

我打算添加到该列表中,以便可以将其用于任何产品。如果外面有人知道用户代理,或者即使有新的销售,我如何确保它可以加载所有数据,请告诉我。因为如果下一次我再次运行该脚本,那么我还会在程序中对结果的数目进行硬编码。