我编写了一个简单的图像刮刀脚本,适用于大多数情况。我遇到了一个网站,里面有一些不错的jpg
壁纸,我想抓住这些链接。该脚本工作正常,但也打印不需要的base64
数据图像链接。如何排除这些base64
链接?
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.find_all('img'):
image = (link.get('src'))
print image
输出:
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/cloudy-ubuntu-mate.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/ubuntu-feeling.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/two-gentlemen-in-car.jpg
data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw==
更新。 谢谢您的帮助。所以完成的代码看起来像下载所有图像。干杯:)
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
img_url = 'https://assets.hongkiat.com/uploads/60-most-execellent-ubuntu-wallpapers/'
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.select('img[src$=".jpg"]'):
image = (link['src'])
image_name = (img_url + image).split('/')[-1]
print ('Downloading: {}'.format(image_name))
r2 = requests.get(image)
with open(image_name, 'wb') as f:
f.write(r2.content)
答案 0 :(得分:1)
给它一个机会。它会获取所需的结果。我在这里使用了.select()
而不是.find_all()
。
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
soup = BeautifulSoup(r.content, 'lxml')
for link in soup.select('img[src$=".jpg"]'):
print(link['src'])
或者,如果您更喜欢使用.find_all()
:
for link in soup.find_all('img'):
if ".jpg" in link['src']:
print(link['src'])
答案 1 :(得分:0)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
htmldata = urlopen('https://www.hongkiat.com/blog/60-most-execellent-ubuntu-wallpapers/')
)
soup = BeautifulSoup(htmldata, 'html.parser')
result = soup.find_all('img' , src=re.compile(r".*?(?=jpeg|png|jpg)"))