Question

我正在尝试使用Beautiful Soup 4帮助我从Imgur下载图像，尽管我怀疑Imgur部分是否相关。例如，我在这里使用网页：https://imgur.com/t/lenovo/mLwnorj

我的代码如下：

import webbrowser, time, sys, requests, os, bs4      # Not all libraries are used in this code snippet
from selenium import webdriver

browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")

res = requests.get(https://imgur.com/t/lenovo/mLwnorj)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")

imageElement = soup.findAll('img', {'class': 'post-image-placeholder'})
print(imageElement)

Imgur链接上的HTML代码包含一部分，内容为：

<img alt="" src="//i.imgur.com/JfLsH5y.jpg" class="post-image-placeholder" style="max-width: 100%; min-height: 546px;" original-title="">

我找到了

，方法是使用Inspect Element中的指向并单击工具在页面上选择了第一个图像元素。

问题是我希望imageElement中有两个项目，每个图像一个，但是打印功能显示[]。我还尝试了soup.findAll('img', {'class': 'post-image-placeholder'})之类的其他形式的soup.findall("img[class='post-image-placeholder']")，但这没什么区别。

此外，当我使用

imageElement = soup.select("h1[class='post-title']")

，为了测试，打印功能确实返回了匹配项，这让我想知道它是否与标签有关。

[<h1 class="post-title">Cable management increases performance. </h1>]

感谢您的时间和精力

Answer 1

这里的基本问题似乎是第一次加载页面时实际的<img ...>元素不存在。我认为，对此的最佳解决方案是利用您已经可以获取图像的Selenium Webdriver。 Selenium将允许页面（使用JavaScript和所有语言）正确呈现，然后找到您关心的任何元素。

例如：

import webbrowser, time, sys, requests, os, bs4      # Not all libraries are used in this code snippet
from selenium import webdriver

# For pretty debugging output
import pprint


browser = webdriver.Firefox()
browser.get("https://imgur.com/t/lenovo/mLwnorj")

# Give the page up to 10 seconds of a grace period to finish rendering
# before complaining about images not being found.
browser.implicitly_wait(10)

# Find elements via Selenium's search
selenium_image_elements = browser.find_elements_by_css_selector('img.post-image-placeholder')
pprint.pprint(selenium_image_elements)

# Use page source to attempt to find them with BeautifulSoup 4
soup = bs4.BeautifulSoup(browser.page_source, features="html.parser")

soup_image_elements = soup.findAll('img', {'class': 'post-image-placeholder'})
pprint.pprint(soup_image_elements)

~~我不能说我已经站在一边测试了这段代码，~~但一般的概念应该可以。

更新：

我继续进行了测试，修复了代码中的一些错误，然后得到了我希望看到的结果：

Answer 2

如果网站将在页面加载后插入对象，则需要使用Selenium代替https://localhost:8443。

requests

Beautiful Soup 4 findall（）与<img>标记中的元素不匹配

2 个答案: