无法在亚马逊中提取商品标题

时间:2019-12-26 07:05:34

标签: python-3.x web-scraping beautifulsoup python-requests web-crawler

当我尝试使用下面的代码来了解Sony耳机的标题时,代码的结果为None

import requests    
from bs4 import BeautifulSoup

URL = 'https://www.amazon.com/Sony-Noise-Cancelling-Headphones- 
       WH1000XM3/dp/B07G4MNFS1/ref=sxin_0_ac_d_rm?ac_md=0-0-c29ueQ%3D%3D- 
       ac_d_rm&keywords=sony&pd_rd_i=B07G4MNFS1&pd_rd_r=3e6d5325-8ee4-4ba8-a84f- 
       1b7cf2ce98bf&pd_rd_w=BVSFq&pd_rd_wg=I0LMZ&pf_rd_p=e2f20af2-9651-42af-9a45- 
       89425d5bae34&pf_rd_r=VGT25BXXZNDE3B61A994&psc=1&qid=1577253649&smid=ATVPDKIKX0DER'

headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like 
Gecko) Chrome/79.0.3945.88 Safari/537.36"}

page = requests.get(URL, headers=headers)    
soup = BeautifulSoup(page.content, "html.parser")
soup.prettify()

#print(soup)

title = soup.find_all('span', {'id':'productTitle'})                        

print(title, len(title))   

当前输出为:

[ ] 0

2 个答案:

答案 0 :(得分:1)

import requests
from bs4 import BeautifulSoup

r = requests.get("https://www.amazon.com/Sony-Noise-Cancelling-Headphones-WH1000XM3/dp/B07G4MNFS1/ref=sxin_0_ac_d_rm?ac_md=0-0-c29ueQ==-ac_d_rm&keywords=sony&pd_rd_i=B07G4MNFS1&pd_rd_r=3e6d5325-8ee4-4ba8-a84f-1b7cf2ce98bf&pd_rd_w=BVSFq&pd_rd_wg=I0LMZ&pf_rd_p=e2f20af2-9651-42af-9a45-89425d5bae34&pf_rd_r=VGT25BXXZNDE3B61A994&psc=1&qid=1577253649&smid=ATVPDKIKX0DER")
soup = BeautifulSoup(r.text, 'html.parser')

for item in soup.findAll("span", {'id': 'productTitle'}):
    print(item.get_text(strip=True))

输出:

Sony Noise Cancelling Headphones WH1000XM3: Wireless Bluetooth Over the Ear Headphones with Mic and Alexa voice control - Industry Leading Active Noise Cancellation - Black

在线运行代码:Click Here

答案 1 :(得分:1)

我花了最后两个小时试图用BeautifulSoup来刮掉这个标题。我尝试在页面上抓取其他元素。没有成功我尝试将原始内容发送到文件,但由于存在奇怪的字符而中断。

我尝试了艾哈迈德的回答,但仍然一无所获。我尝试了其他在网上找到的其他解决方案,但仍然一无所获。我一辈子都想不通如何使用BeautifulSoup来解决这个问题。

我知道您使用Selenium,所以这里是Selenium解决方案。

from selenium import webdriver
bot = webdriver.Chrome()
bot.get("https://www.amazon.com/Sony-Noise-Cancelling-Headphones-WH1000XM3/dp/B07G4MNFS1/ref=sxin_0_ac_d_rm?ac_md=0-0-c29ueQ==-ac_d_rm&keywords=sony&pd_rd_i=B07G4MNFS1&pd_rd_r=3e6d5325-8ee4-4ba8-a84f-1b7cf2ce98bf&pd_rd_w=BVSFq&pd_rd_wg=I0LMZ&pf_rd_p=e2f20af2-9651-42af-9a45-89425d5bae34&pf_rd_r=VGT25BXXZNDE3B61A994&psc=1&qid=1577253649&smid=ATVPDKIKX0DER")
title = bot.find_element_by_id('productTitle').text
print(title)
bot.close()