我正在编辑一个Python脚本,它从网页上获取图像(需要私人登录,因此我没有必要发布链接)。它使用BeautifulSoup库,原始脚本为here。
我想要做的是自定义此脚本以获取单个图像,其HTML标记具有id属性id="fimage"
。它没有课。这是代码:
from bs4 import BeautifulSoup
import urllib.request
import urllib.parse
import urllib.error
from urllib.request import urlopen
# use this image scraper from the location that
#you want to save scraped images to
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
images = [img for img in soup.find(id="fimage")]
print (images)
print (str(len(images)) + " images found.")
# print 'Downloading images to current working directory.'
#compile our unicode list of image links
image_links = [each.get('src') for each in images]
for each in image_links:
filename=each.split('/')[-1]
urlretrieve(each, filename)
return image_links
get_images('http://myurl');
#a standard call looks like this
#get_images('http://www.wookmark.com')
出于某种原因,这似乎不起作用。在命令行上运行时,它会生成输出:
[]
0 images found.
更新
好的,所以我已经更改了代码,现在脚本似乎找到了我正在尝试下载的图像,但它在运行时抛出了另一个错误,无法下载。
以下是更新后的代码:
from bs4 import BeautifulSoup
from urllib import request
import urllib.parse
import urllib.error
from urllib.request import urlopen
def make_soup(url):
html = urlopen(url).read()
return BeautifulSoup(html)
def get_images(url):
soup = make_soup(url)
#this makes a list of bs4 element tags
image = soup.find(id="logo", src=True)
if image is None:
print('No images found.')
return
image_link = image['src']
filename = image_link.split('/')[-1]
request.urlretrieve(filename)
return image_link
try:
get_images('https://pypi.python.org/pypi/ClientForm/0.2.10');
except ValueError as e:
print("File could not be retrieved.", e)
else:
print("It worked!")
#a standard call looks like this
#get_images('http://www.wookmark.com')
在命令行上运行时,输出为:
File could not be retrieved. unknown url type: 'python-logo.png'
答案 0 :(得分:1)
soup.find(id="fimage")
会返回一个结果,而不是列表。您正在尝试遍历该元素,这意味着它将尝试并列出子节点,并且没有。
只需调整代码即可考虑您只有一个结果;删除所有循环:
image = soup.find(id="fimage", src=True)
if image is None:
print('No matching image found')
return
image_link = image['src']
filename = image_link.split('/')[-1]
urlretrieve(each, filename)
我稍微改进了搜索;添加src=True
时,如果标记具有src
属性,则只匹配该标记。