这是要测试的网址 https://stockx.com/puma?prices=300-400,200-300&size_types=men&years=2017
我能够提取所有产品详细页面fig = plt.figure()
fig, ax = plt.subplots(5, 2, figsize=(11,11))
dictOfDF = { 1:df1 , 2:df2 , ... , 10:df10 }
key = 1
for i in range(5):
for j in range(2):
subplot = ax[i,j]
dfAtKey = dictOfDF[key]
for ind in dfAtKey.index:
subplot.bar( dfAtkey.loc[ind,0] , dfAtKey.loc[ind,1] , width = 0.5 )
key += 1
的链接,但是最后我只得到一个结果。它应该转到所有链接并提取我的名称和img网址。我在这里想念什么?
当前输出结果为json
href
这是工作代码
[
{
"product_name": "Puma Clyde WWE Undertaker Black",
"imgurl": "https://stockx.imgix.net/Puma-Clyde-WWE-Undertaker-Black.png?fit=fill&bg=FFFFFF&w=700&h=500&auto=format,compress&q=90&dpr=2&trim=color&updated_at=1538080256"
}
]
答案 0 :(得分:2)
您可以根据我的要求完成全部操作。我从访问过的页面中选择了一些随机项目来证明访问过。
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
baseURL = 'https://stockx.com'
final = []
with requests.Session() as s:
res = s.get('https://stockx.com/puma?prices=300-400,200-300&size_types=men&years=2017')
soup = bs(res.content, 'lxml')
items = soup.select('#products-container [href]')
titles = [item['id'] for item in items]
links = [baseURL + item['href'] for item in items]
results = list(zip(titles, links))
df = pd.DataFrame(results)
for result in results:
res = s.get(result[1])
soup = bs(res.content, 'lxml')
details = [item.text for item in soup.select('.detail')]
final.append([result[0], result[1], details])
df2 = pd.DataFrame(final)
df2.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8',index = False )