从多个href列表中抓取python硒

时间:2019-03-10 04:05:01

标签: python selenium-webdriver web-scraping

这是要测试的网址 https://stockx.com/puma?prices=300-400,200-300&size_types=men&years=2017

我能够提取所有产品详细页面fig = plt.figure() fig, ax = plt.subplots(5, 2, figsize=(11,11)) dictOfDF = { 1:df1 , 2:df2 , ... , 10:df10 } key = 1 for i in range(5): for j in range(2): subplot = ax[i,j] dfAtKey = dictOfDF[key] for ind in dfAtKey.index: subplot.bar( dfAtkey.loc[ind,0] , dfAtKey.loc[ind,1] , width = 0.5 ) key += 1 的链接,但是最后我只得到一个结果。它应该转到所有链接并提取我的名称和img网址。我在这里想念什么?

当前输出结果为json

href

这是工作代码

[
    {
        "product_name": "Puma Clyde WWE Undertaker Black",
        "imgurl": "https://stockx.imgix.net/Puma-Clyde-WWE-Undertaker-Black.png?fit=fill&bg=FFFFFF&w=700&h=500&auto=format,compress&q=90&dpr=2&trim=color&updated_at=1538080256"
    }
]

1 个答案:

答案 0 :(得分:2)

您可以根据我的要求完成全部操作。我从访问过的页面中选择了一些随机项目来证明访问过。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
baseURL = 'https://stockx.com'
final = []
with requests.Session() as s:
    res = s.get('https://stockx.com/puma?prices=300-400,200-300&size_types=men&years=2017')
    soup = bs(res.content, 'lxml')
    items  = soup.select('#products-container [href]')
    titles = [item['id'] for item in items]
    links = [baseURL + item['href'] for item in items]
    results = list(zip(titles, links))
    df = pd.DataFrame(results) 
    for result in results:
        res = s.get(result[1])
        soup = bs(res.content, 'lxml')
        details = [item.text for item in soup.select('.detail')]
        final.append([result[0], result[1], details])
df2 = pd.DataFrame(final)
df2.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8',index = False )