我正在尝试提取产品描述,第一个循环遍历每个产品,嵌套循环进入每个产品页面并获取描述以进行提取。
for page in range(1, 2):
guitarPage =
requests.get('https://www.guitarguitar.co.uk/guitars/acoustic/page-
{}'.format(page)).text
soup = BeautifulSoup(guitarPage, 'lxml')
guitars = soup.find_all(class_='col-xs-6 col-sm-4 col-md-4 col-lg-3')
这是每种产品的循环
for guitar in guitars:
title_text = guitar.h3.text.strip()
print('Guitar Name: ', title_text)
price = guitar.find(class_='price bold small').text.strip()
print('Guitar Price: ', price)
priceSave = guitar.find('span', {'class': 'price save'})
if priceSave is not None:
priceOf = priceSave.text
print(priceOf)
else:
print("No discount!")
image = guitar.img.get('src')
print('Guitar Image: ', image)
productLink = guitar.find('a').get('href')
linkProd = url + productLink
print('Link of product', linkProd)
在这里,我将收集的链接添加到数组中
productsPage.append(linkProd)
这是我进入每个产品页面并提取说明的尝试
for products in productsPage:
response = requests.get(products)
soup = BeautifulSoup(response.content, "lxml")
productsDetails = soup.find("div", {"class":"description-preview"})
if productsDetails is not None:
description = productsDetails.text
# print('product detail: ', description)
else:
print('none')
time.sleep(0.2)
if None not in(title_text,price,image,linkProd, description):
products = {
'title': title_text,
'price': price,
'discount': priceOf,
'image': image,
'link': linkProd,
'description': description,
}
result.append(products)
with open('datas.json', 'w') as outfile:
json.dump(result, outfile, ensure_ascii=False, indent=4, separators=(',', ': '))
# print(result)
print('--------------------------')
time.sleep(0.5)
结果应该是
{
"title": "Yamaha NTX700 Electro Classical Guitar (Pre-Owned) #HIM041005",
"price": "£399.00",
"discount": null,
"image": "https://images.guitarguitar.co.uk/cdn/large/150/PXP190415342158006-3115645f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/pxp190415342158006-3115645--yamaha-ntx700-electro-classical-guitar-pre-owned-him",
"description": "\nProduct Overview\nThe versatile, contemporary styled NTX line is designed with thinner bodies, narrower necks, 14th fret neck joints, and cutaway designs to provide greater comfort and playability f... read more\n"
},
但是该描述适用于第一个,以后不会更改。
[
{
"title": "Yamaha APX600FM Flame Maple Tobacco Sunburst",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340677008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340677008--yamaha-apx600fm-flame-maple-tobacco-sunburst",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha APX600FM Flame Maple Amber",
"price": "£239.00",
"discount": "Save £160.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/150/190315340676008f.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/190315340676008--yamaha-apx600fm-flame-maple-amber",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
},
{
"title": "Yamaha AC1R Acoustic Electric Concert Size Rosewood Back And Sides with SRT Pickup",
"price": "£399.00",
"discount": "Save £267.00",
"image": "https://images.guitarguitar.co.uk/cdn/large/105/11012414211132.jpg?h=190&w=120&mode=crop&bg=ffffff&quality=70&anchor=bottomcenter",
"link": "https://www.guitarguitar.co.uk/product/11012414211132--yamaha-ac1r-acoustic-electric-concert-size-rosewood-back-and-sid",
"description": "\nProduct Overview\nOne of the world's best-selling acoustic-electric guitars, the APX600 series introduces an upgraded version with a flame maple top. APX's thinline body combines incredible comfort,... read more\n"
}
]
这是我得到的结果,它一直在变化,有时会显示产品的先前描述
答案 0 :(得分:0)
它确实循环了,但似乎在服务器端已采取了一些保护措施,并且失败的页面发生了变化。我检查失败的页面,并搜索了内容。在我的测试中,似乎没有任何一种方法可以满足要求(我没有尝试超过2的睡眠时间,但是尝试通过<< 2的睡眠尝试进行一些IP和用户代理更改。)
您可以尝试交替使用IP和用户代理,取消重试,更改两次请求之间的时间。
正在更改代理:https://www.scrapehero.com/how-to-rotate-proxies-and-ip-addresses-using-python-3/