每天从网址中抓取不同的图片

时间:2018-07-23 19:14:39

标签: python

我正在尝试使用Python编写脚本,该脚本将每天更新的图像下载到此站点上:

https://apod.nasa.gov/apod/astropix.html

我试图关注这篇文章的最高评论: How to extract and download all images from a website using beautifulSoup?

所以,这就是我的代码当前的样子:

import re
import requests
from bs4 import BeautifulSoup

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]


for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative 
            # if it is provide the base url which also happens 
            # to be the site variable atm. 
            url = '{}{}'.format(site, url)
        response = requests.get(url)
        f.write(response.content)

但是,当我运行程序时,出现此错误:

Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'

所以我的Regex好像有问题吗?

2 个答案:

答案 0 :(得分:1)

您要查找的正则表达式group()是0,而不是1。它包含图像路径。同样,当图像源路径是相对路径时,URL格式设置不正确。我使用urllib内置模块来解析网站网址:

import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

site = 'https://apod.nasa.gov/apod/astropix.html'

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
    filename = re.sub(r'\d{4,}\.', '.', filename.group(0))

    with open(filename, 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            hostname = urlparse(site).hostname
            scheme = urlparse(site).scheme
            url = '{}://{}/{}'.format(scheme, hostname, url)

        # for full resolution image the last four digits needs to be striped
        url = re.sub(r'\d{4,}\.', '.', url)

        print('Fetching image from {} to {}'.format(url, filename))
        response = requests.get(url)
        f.write(response.content)

输出:

Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg

并将图像保存为FermiFinals.jpg

答案 1 :(得分:1)

我认为问题是site变量。说完之后,它会尝试附加sitehttps://apod.nasa.gov/apod/astropix.html的图像路径。如果您只是删除astropix.html,就可以正常工作。我下面的内容只是您所拥有内容的一小部分修改,请复制/粘贴并寄出!

import re
import requests
from bs4 import BeautifulSoup

site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")

response = requests.get(site)

soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')

urls = [img['src'] for img in img_tags]

for url in urls:
    filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
    with open(filename.group(1), 'wb') as f:
        if 'http' not in url:
            # sometimes an image source can be relative
            # if it is provide the base url which also happens
            # to be the site variable atm.
            url = '{}{}'.format(site_path_only, url)
        response = requests.get(url)
        f.write(response.content)

请注意,如果它正在下载映像,但说它已损坏并且大小为1k,则由于某种原因,您可能会得到404。只需在记事本中打开“图片”,然后阅读其返回的HTML。