美丽的汤Python findAll返回空列表

时间:2020-10-30 04:25:07

标签: python web-scraping beautifulsoup

我正在尝试刮擦Amazon Alexa技能:https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1

就目前而言,我只是想获取技能名称(Paypal),但是由于某种原因,这将返回一个空列表。我已经查看了网站的检查元素,并且知道它应该给我起这个名字,所以我不确定出什么问题了。我的代码如下:

# Video Compiler
cliplist = []
count = 1
for filename in os.listdir(location):
    if filename.endswith(".mp4"):
        cliplist.insert(0, VideoFileClip(f'{Path(location)}/{filename}'))
        print(f'Clip {count} Processed')
        count += 1

for clip in cliplist:
    clip.resize(height=1080)

2 个答案:

答案 0 :(得分:1)

页面内容已加载javascript,因此您不能仅使用BeautifulSoup进行抓取。您必须使用selenium之类的其他模块来模拟 javascript 执行。

这里是一个例子:

from bs4 import BeautifulSoup as soup
from selenium import webdriver

url='YOUR URL'

driver = webdriver.Firefox()
driver.get(url)

page = driver.page_source
page_soup = soup(page,'html.parser')

containers = page_soup.find_all("h1", {"class" : "a2s-title-content"})
print(containers)
print(len(containers))

您也可以使用chrome-driveredge-driver参见here

答案 1 :(得分:0)

尝试设置User-AgentAccept-Language HTTP标头,以防止服务器向您发送验证码页面:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:82.0) Gecko/20100101 Firefox/82.0',
    'Accept-Language': 'en-US,en;q=0.5'
}

url = 'https://www.amazon.com/PayPal/dp/B075764QCX/ref=sr_1_1?dchild=1&keywords=paypal&qid=1604026451&s=digital-skills&sr=1-1'

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'lxml')
name = soup.find("h1", {"class" : "a2s-title-content"})
print(name.get_text(strip=True))

打印:

PayPal