我正在尝试使用Python编写脚本,该脚本将每天更新的图像下载到此站点上:
https://apod.nasa.gov/apod/astropix.html
我试图关注这篇文章的最高评论: How to extract and download all images from a website using beautifulSoup?
所以,这就是我的代码当前的样子:
import re
import requests
from bs4 import BeautifulSoup
site = 'https://apod.nasa.gov/apod/astropix.html'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site, url)
response = requests.get(url)
f.write(response.content)
但是,当我运行程序时,出现此错误:
Traceback on line 17
with open(filename.group(1), 'wb' as f:
AttributeError: 'NoneType' object has no attribute 'group'
所以我的Regex好像有问题吗?
答案 0 :(得分:1)
您要查找的正则表达式group()
是0,而不是1。它包含图像路径。同样,当图像源路径是相对路径时,URL格式设置不正确。我使用urllib
内置模块来解析网站网址:
import re
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
site = 'https://apod.nasa.gov/apod/astropix.html'
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'([\w_-]+[.](jpg|gif|png))$', url)
filename = re.sub(r'\d{4,}\.', '.', filename.group(0))
with open(filename, 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
hostname = urlparse(site).hostname
scheme = urlparse(site).scheme
url = '{}://{}/{}'.format(scheme, hostname, url)
# for full resolution image the last four digits needs to be striped
url = re.sub(r'\d{4,}\.', '.', url)
print('Fetching image from {} to {}'.format(url, filename))
response = requests.get(url)
f.write(response.content)
输出:
Fetching image from https://apod.nasa.gov/image/1807/FermiFinals.jpg to FermiFinals.jpg
并将图像保存为FermiFinals.jpg
答案 1 :(得分:1)
我认为问题是site
变量。说完之后,它会尝试附加site
和https://apod.nasa.gov/apod/astropix.html
的图像路径。如果您只是删除astropix.html
,就可以正常工作。我下面的内容只是您所拥有内容的一小部分修改,请复制/粘贴并寄出!
import re
import requests
from bs4 import BeautifulSoup
site = "https://apod.nasa.gov/apod/astropix.html"
site_path_only = site.replace("astropix.html","")
response = requests.get(site)
soup = BeautifulSoup(response.text, 'html.parser')
img_tags = soup.find_all('img')
urls = [img['src'] for img in img_tags]
for url in urls:
filename = re.search(r'/([\w_-]+[.](jpg|gif|png))$', url)
with open(filename.group(1), 'wb') as f:
if 'http' not in url:
# sometimes an image source can be relative
# if it is provide the base url which also happens
# to be the site variable atm.
url = '{}{}'.format(site_path_only, url)
response = requests.get(url)
f.write(response.content)
请注意,如果它正在下载映像,但说它已损坏并且大小为1k,则由于某种原因,您可能会得到404
。只需在记事本中打开“图片”,然后阅读其返回的HTML。