如何从网页获取图像

时间:2014-10-13 16:22:41

标签: python image github web-scraping beautifulsoup

我正在编辑一个Python脚本,它从网页上获取图像(需要私人登录,因此我没有必要发布链接)。它使用BeautifulSoup库,原始脚本为here

我想要做的是自定义此脚本以获取单个图像,其HTML标记具有id属性id="fimage"。它没有课。这是代码:

from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen

# use this image scraper from the location that 
#you want to save scraped images to

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags
    images = [img for img in soup.find(id="fimage")]
    print (images)
    print (str(len(images)) + " images found.")
    # print 'Downloading images to current working directory.'
    #compile our unicode list of image links
    image_links = [each.get('src') for each in images]
    for each in image_links:
        filename=each.split('/')[-1]
        urlretrieve(each, filename)
    return image_links


get_images('http://myurl');


#a standard call looks like this
#get_images('http://www.wookmark.com')

出于某种原因,这似乎不起作用。在命令行上运行时,它会生成输出:

[]
0 images found.

更新

好的,所以我已经更改了代码,现在脚本似乎找到了我正在尝试下载的图像,但它在运行时抛出了另一个错误,无法下载。

以下是更新后的代码:

from bs4 import BeautifulSoup
from urllib import request
import urllib.parse
import urllib.error
from urllib.request import urlopen

def make_soup(url):
    html = urlopen(url).read()
    return BeautifulSoup(html)

def get_images(url):
    soup = make_soup(url)
    #this makes a list of bs4 element tags

    image = soup.find(id="logo", src=True)
    if image is None:
        print('No images found.')
        return

    image_link = image['src']
    filename = image_link.split('/')[-1]
    request.urlretrieve(filename)
    return image_link
try:    
    get_images('https://pypi.python.org/pypi/ClientForm/0.2.10');
except ValueError as e: 
    print("File could not be retrieved.", e)
else:
    print("It worked!")

#a standard call looks like this
#get_images('http://www.wookmark.com')

在命令行上运行时,输出为:

File could not be retrieved. unknown url type: 'python-logo.png'

1 个答案:

答案 0 :(得分:1)

soup.find(id="fimage")会返回一个结果,而不是列表。您正在尝试遍历该元素,这意味着它将尝试并列出子节点,并且没有。

只需调整代码即可考虑您只有一个结果;删除所有循环:

image = soup.find(id="fimage", src=True)
if image is None:
    print('No matching image found')
    return

image_link = image['src']
filename = image_link.split('/')[-1]
urlretrieve(each, filename)

我稍微改进了搜索;添加src=True时,如果标记具有src属性,则只匹配该标记。