我在stockx网站上做了一个刮板。该程序从模型弹出表中刮除销售数据。解析出售价,将所有价格加在一起,得出平均售价。问题在于,每次我运行该程序时,它都会给出一系列平均值。
from bs4 import BeautifulSoup
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
import time
from selenium.webdriver.common.keys import Keys
from itertools import islice
PATH = "C:\Program Files (x86)\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.get("https://stockx.com/supreme-patchwork-mohair-cardigan-multicolor")
time.sleep(3)
driver.find_element_by_xpath('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[2]/div/div[2]/div/div/button').click()
driver.find_element_by_xpath('//*[@id="root"]/div[1]/div[2]/div[2]/div[9]/div/div/div/div[2]/div/div[1]/div[2]/a').click()
time.sleep(3)
while EC.element_to_be_clickable((By.LINK_TEXT,'Load More')):
try:
driver.find_element_by_css_selector('body > div:nth-child(55) > div > div > div > div.modal-body > div > button').click()
except Exception as e:
print(e)
break
src = driver.page_source
soup = BeautifulSoup(src, features='html.parser')
table = soup.table
table_rows = table.find_all('tr')
raw_price_data = list()
bs = list()
for row in islice(table_rows, 1, None):
td = row.find_all('td')[1]
raw_price_data.append(td.text[1:])
raw_price_data = list(map(int, raw_price_data))
Sum = sum(raw_price_data)
avg = Sum / len(raw_price_data)
total_sales = len(raw_price_data)
avg = round(avg, 2)
print(f'Total Sales:{total_sales}')
print(f'Average Profit:{avg}')
我最近添加了总销售清单,以告诉我beutifulsoup废料有多少行,并且每次都与我想像的不同。
总销量:1390 平均利润:402.29
总销量:990 平均利润:400.05
总销量:2270 平均利润:407.36
这些是运行该程序后的所有不同结果。 我是python的新手,非常感谢webscrapping任何帮助。
这就是我使用未公开的API能够编写的代码。我使用开发工具上的“网络”标签找到的API变量。对于我发现的相同请求,PAYLOAD是查询参数。尽管我仍然无法获取我想要的python数据,只是错误代码。
import requests
from urllib.parse import urlencode
headers = {
'authority': 'stockx.com',
'appos': 'web',
'x-requested-with': 'XMLHttpRequest',
'authorization': '',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'appversion': '0.1',
'accept': '*/*',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://stockx.com/supreme-patchwork-mohair-cardigan-multicolor',
'accept-language': 'en-US,en;q=0.9',
'if-none-match': 'W/^\\^c3f-9c/EjYGTDiuj1w0OlHbjycsHHYU^\\^',
}
PAYLOAD = {
"state": "480",
"currency": "USD",
"limit": 203,
"page": "1",
"sort": "createdAt",
"order": "DESC",
"country": "US"
}
api = 'https://stockx.com/api/products/509c6166-53d4-49bf-9221-fc10cb298911/chart'
response = requests.get(f'{api}{urlencode(PAYLOAD)}', headers=headers)
print(response)
答案 0 :(得分:1)
抛弃selenium
和BeautifulSoup
而选择纯requests
怎么样?
如何?好吧,您正在抓取该网页的API。您只需要一个产品的sku
号。你怎么得到的?
您在产品的页面源代码中进行梳理,只是发现有一堆<script>
标记,它们看起来像JSON
数据。很好,对吧?
此外,您意识到sku
在这些标记之一中,并且始终在model
和color
值之间。为什么不regex
淘汰呢?
然后,将sku
放到API
网址中,解析响应并计算平均值。
将所有内容放在一起:
import re
from datetime import datetime
from urllib.parse import urlencode
import requests
PRODUCT_URL = "https://stockx.com/supreme-brushed-mohair-cardigan-black"
PRODUCT_NAME = " ".join(i.title() for i in PRODUCT_URL.split('/')[-1].split('-'))
PAYLOAD = {
"start_date": "all",
"end_date": "2020-10-24",
"intervals": 100,
"format": "highstock",
"currency": "USD"
}
HEADERS = {
"referer": PRODUCT_URL,
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
"x-requested-with": "XMLHttpRequest",
}
product_page = requests.get(PRODUCT_URL, headers=HEADERS).text
product_sku = "".join(re.findall(r'"sku":"(.+)","color', product_page))
api_url = f"https://stockx.com/api/products/{product_sku}/chart?"
response = requests.get(f"{api_url}{urlencode(PAYLOAD)}", headers=HEADERS).json()
print(f"{response['title']['text']} for {PRODUCT_NAME}:")
series = response["series"][0]["data"]
for item in series:
timestamp, price = item
human_readable_date = datetime.fromtimestamp(int(timestamp / 1000))
print(f"{human_readable_date} - {price}")
print("-" * 25)
print(f"Average: {sum(i[1] for i in series) / len(series)} {PAYLOAD['currency']}")
这将输出:
Average price over time for Supreme Brushed Mohair Cardigan Black:
2020-10-22 17:18:19 - 250
2020-10-22 17:49:13 - 250
2020-10-22 18:20:07 - 250
2020-10-22 18:51:01 - 250
2020-10-22 19:21:56 - 250
2020-10-22 19:52:50 - 250
2020-10-22 20:23:44 - 250
2020-10-22 20:54:38 - 250
2020-10-22 21:25:33 - 250
2020-10-22 21:56:27 - 250
2020-10-22 22:27:21 - 250
2020-10-22 22:58:15 - 250
2020-10-22 23:29:10 - 250
2020-10-23 00:00:04 - 250
2020-10-23 00:30:58 - 250
2020-10-23 01:01:53 - 200
2020-10-23 01:32:47 - 219
2020-10-23 02:03:41 - 219
2020-10-23 02:34:35 - 219
2020-10-23 03:05:30 - 219
2020-10-23 03:36:24 - 219
2020-10-23 04:07:18 - 219
2020-10-23 04:38:12 - 219
2020-10-23 05:09:07 - 219
2020-10-23 05:40:01 - 219
2020-10-23 06:10:55 - 219
2020-10-23 06:41:50 - 219
2020-10-23 07:12:44 - 219
2020-10-23 07:43:38 - 219
2020-10-23 08:14:32 - 219
2020-10-23 08:45:27 - 219
2020-10-23 09:16:21 - 219
2020-10-23 09:47:15 - 219
2020-10-23 10:18:09 - 219
2020-10-23 10:49:04 - 219
2020-10-23 11:19:58 - 219
2020-10-23 11:50:52 - 219
2020-10-23 12:21:46 - 219
2020-10-23 12:52:41 - 219
2020-10-23 13:23:35 - 219
2020-10-23 13:54:29 - 219
2020-10-23 14:25:24 - 219
2020-10-23 14:56:18 - 219
2020-10-23 15:27:12 - 257
2020-10-23 15:58:06 - 257
2020-10-23 16:29:01 - 315
2020-10-23 16:59:55 - 315
2020-10-23 17:30:49 - 315
2020-10-23 18:01:43 - 315
2020-10-23 18:32:38 - 315
2020-10-23 19:03:32 - 315
2020-10-23 19:34:26 - 315
2020-10-23 20:05:21 - 315
2020-10-23 20:36:15 - 315
2020-10-23 21:07:09 - 315
2020-10-23 21:38:03 - 315
2020-10-23 22:08:58 - 315
2020-10-23 22:39:52 - 315
2020-10-23 23:10:46 - 315
2020-10-23 23:41:40 - 315
2020-10-24 00:12:35 - 315
2020-10-24 00:43:29 - 315
2020-10-24 01:14:23 - 315
2020-10-24 01:45:18 - 315
2020-10-24 02:16:12 - 315
2020-10-24 02:47:06 - 315
2020-10-24 03:18:00 - 315
2020-10-24 03:48:55 - 315
2020-10-24 04:19:49 - 315
2020-10-24 04:50:43 - 315
2020-10-24 05:21:37 - 315
2020-10-24 05:52:32 - 315
2020-10-24 06:23:26 - 315
2020-10-24 06:54:20 - 315
2020-10-24 07:25:14 - 315
2020-10-24 07:56:09 - 277
2020-10-24 08:27:03 - 277
2020-10-24 08:57:57 - 277
2020-10-24 09:28:52 - 277
2020-10-24 09:59:46 - 277
2020-10-24 10:30:40 - 277
2020-10-24 11:01:34 - 277
2020-10-24 11:32:29 - 277
2020-10-24 12:03:23 - 277
2020-10-24 12:34:17 - 277
2020-10-24 13:05:11 - 277
2020-10-24 13:36:06 - 277
2020-10-24 14:07:00 - 277
2020-10-24 14:37:54 - 277
2020-10-24 15:08:49 - 277
2020-10-24 15:39:43 - 277
2020-10-24 16:10:37 - 277
2020-10-24 16:41:31 - 277
2020-10-24 17:12:26 - 277
2020-10-24 17:43:20 - 277
2020-10-24 18:14:14 - 277
2020-10-24 18:45:08 - 277
2020-10-24 19:16:03 - 277
2020-10-24 19:46:57 - 277
2020-10-24 20:17:51 - 277
-------------------------
Average: 267.52 USD
赏金:
这适用于任何产品网址! XD
编辑:
要获得activity
端点的响应,请尝试以下操作:
import re
import requests
from urllib.parse import urlencode
PRODUCT_URL = "https://stockx.com/supreme-brushed-mohair-cardigan-black"
PRODUCT_NAME = " ".join(i.title() for i in PRODUCT_URL.split('/')[-1].split('-'))
HEADERS = {
"referer": PRODUCT_URL,
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36",
"x-requested-with": 'XMLHttpRequest',
}
PAYLOAD = {
"state": "480",
"currency": "USD",
"limit": 10,
"sort": "createdAt",
"order": "DESC",
"country": "US"
}
product_page = requests.get(PRODUCT_URL, headers=HEADERS).text
product_sku = "".join(re.findall(r'"sku":"(.+)","color', product_page))
api_url = f"https://stockx.com/api/products/{product_sku}/activity?"
response = requests.get(f'{api_url}{urlencode(PAYLOAD)}', headers=HEADERS).json()
for item in response:
print(f"{item['chainId']} - shoe size: {item['shoeSize']} at {item['amount']} {item['localCurrency']}")
输出:
13451189134512711854 - shoe size: M at 250.6639 USD
13454493613262000326 - shoe size: XL at 305 USD
13454719168825677535 - shoe size: M at 250.451 USD
13454432070832901351 - shoe size: XL at 321.4601 USD
13454370874521531956 - shoe size: XL at 315 USD
13454370577625857582 - shoe size: XL at 320 USD
13450796013705700403 - shoe size: M at 240 USD
...
答案 1 :(得分:0)
所以我终于明白了。 我决定扔掉漂亮的汤和硒,只是在再次阅读@badukers帖子后表示感谢,然后使用request和urllib。尽管badukers方法帮助我分配了正确的路,但他使用的API并没有为我提供我一直在寻找平均利润的数据。在对未公开的API进行了一些研究之后,我能够浏览开发工具以找到对API的请求,我需要/ activity而不是/ chart API。找到正确的API之后,我不得不弄清楚如何使用参数,因此我尝试使用url,直到获得203笔销售。那时,很容易将其转移到python,只需确保我选择了正确的参数和所需的标头即可。我仍然不确定的一件事是用户代理的含义,以及如果脚本未包含但为什么其他所有内容都可以排除的原因,脚本为什么会失败。
这是我的代码,用于收集我要寻找的开衫商品的平均总利润。
import requests
from urllib.parse import urlencode
API_URL = 'https://stockx.com/api/products/509c6166-53d4-49bf-9221-fc10cb298911/activity?'
PAYLOAD = {
"state": "480",
"currency": "USD",
"limit": 203,
"page": "1",
"sort": "createdAt",
"order": "DESC",
"country": "US"
}
HEADERS = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.111 Safari/537.36',
'referer': 'https://stockx.com/supreme-patchwork-mohair-cardigan-multicolor',
}
response = requests.get(f'{API_URL}' + urlencode(PAYLOAD), headers=HEADERS).json()
ProductActivity = response['ProductActivity']
amount = [i['amount'] for i in ProductActivity]
tprofit = list()
for price in amount:
profit = price - 188
tprofit.append(profit)
print(f'Sale Price:Profit--------------{price}:{profit}')
print('-'*40)
ave = sum(tprofit)/len(amount)
print(f'Raw Sale Average--------------{ave}')
我打算添加到该列表中,以便可以将其用于任何产品。如果外面有人知道用户代理,或者即使有新的销售,我如何确保它可以加载所有数据,请告诉我。因为如果下一次我再次运行该脚本,那么我还会在程序中对结果的数目进行硬编码。