Question

我正在编辑一个Python脚本，它从网页上获取图像（需要私人登录，因此我没有必要发布链接）。它使用BeautifulSoup库，原始脚本为here。

我想要做的是自定义此脚本以获取单个图像，其HTML标记具有id属性id="fimage"。它没有课。这是代码：

from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen

# use this image scraper from the location that 
#you want to save scraped images to

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags
    images = [img for img in soup.find(id="fimage")]
    print (images)
    print (str(len(images)) + " images found.")
    # print 'Downloading images to current working directory.'
    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        filename=each.split('/')[-1]
        urlretrieve(each, filename)
    return image_links


get_images('http://myurl');


#a standard call looks like this
#get_images('http://www.wookmark.com')

出于某种原因，这似乎不起作用。在命令行上运行时，它会生成输出：

[]
0 images found.

更新

好的，所以我已经更改了代码，现在脚本似乎找到了我正在尝试下载的图像，但它在运行时抛出了另一个错误，无法下载。

以下是更新后的代码：

from bs4 import BeautifulSoup
from urllib import request
import urllib.parse
import urllib.error
from urllib.request import urlopen

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags

    image = soup.find(id="logo", src=True)
    if image is None:
        print('No images found.')
        return

    image_link = image['src']
    filename = image_link.split('/')[-1]
    request.urlretrieve(filename)
    return image_link
try:    
    get_images('https://pypi.python.org/pypi/ClientForm/0.2.10');
except ValueError as e: 
    print("File could not be retrieved.", e)
else:
    print("It worked!")

#a standard call looks like this
#get_images('http://www.wookmark.com')

在命令行上运行时，输出为：

File could not be retrieved. unknown url type: 'python-logo.png'

Answer 1

soup.find(id="fimage")会返回一个结果，而不是列表。您正在尝试遍历该元素，这意味着它将尝试并列出子节点，并且没有。

只需调整代码即可考虑您只有一个结果;删除所有循环：

image = soup.find(id="fimage", src=True)
if image is None:
    print('No matching image found')
    return

image_link = image['src']
filename = image_link.split('/')[-1]
urlretrieve(each, filename)

我稍微改进了搜索;添加src=True时，如果标记具有src属性，则只匹配该标记。

如何从网页获取图像

1 个答案: