我想从浏览器获取验证码图片。我有这个图片的网址,但是这张图片会更改每个更新的时间(网址是不变的)。
是否有任何解决方案从浏览器获取图片(如“将图片另存为”按钮)?
另一方面,我认为它应该是有效的:
动态capcha的链接 - link
问题通过屏幕截图解决:
browser.save_screenshot('screenshot.png')
img = browser.find_element_by_xpath('//*[@id="cryptogram"]')
loc = img.location
image = cv.LoadImage('screenshot.png', True)
out = cv.CreateImage((150,60), image.depth, 3)
cv.SetImageROI(image, (loc['x'],loc['y'],150,60))
cv.Resize(image, out)
cv.SaveImage('out.jpg', out)
由于
答案 0 :(得分:33)
这是一个完整的例子(使用google的recaptcha作为目标):
import urllib
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.google.com/recaptcha/demo/recaptcha')
# get the image source
img = driver.find_element_by_xpath('//div[@id="recaptcha_image"]/img')
src = img.get_attribute('src')
# download the image
urllib.urlretrieve(src, "captcha.png")
driver.close()
更新:
动态生成图像的问题是每次请求时都会生成新图像。在这种情况下,您有几个选择:
截取屏幕截图
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://moscowsg.megafon.ru/ps/scc/php/cryptographp.php?PHPSESSID=mfc540jkbeme81qjvh5t0v0bnjdr7oc6&ref=114&w=150')
driver.save_screenshot("screenshot.png")
driver.close()
模拟右键单击+“另存为”。有关详细信息,请参阅this thread。
希望有所帮助。
答案 1 :(得分:1)
因此,为了保持相关性,以下是使用seleniumwire
的2020解决方案,该软件包可让您访问浏览器中的请求。您可以轻松地使用它,如下所示:
from seleniumwire import webdriver
# Sometimes, selenium randomly crashed when using seleniumwire, these options fixed that.
# Probably has to do with how it proxies everything.
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get("https://google.com")
for request in driver.requests:
# request.path
# request.method
# request.headers
# request.response is the response instance
# request.response.body is the raw response body in bytes
# if you are using it for a ton of requests, make sure to clear them:
del driver.requests
现在,您为什么需要这个?好吧,例如用于ReCaptcha绕过,或绕过Incapsula之类的东西。使用此功能需要您自担风险。
答案 2 :(得分:0)
可以保存整个页面的屏幕截图,然后剪切图像,但是您也可以使用“ webdriver”中的“ find”方法找到要保存的图像,然后编写“ screenshot_as_png”如下所示的属性:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.webpagetest.org/')
with open('filename.png', 'wb') as file:
file.write(driver.find_element_by_xpath('/html/body/div[1]/div[5]/div[2]/table[1]/tbody/tr/td[1]/a/div').screenshot_as_png)
有时由于滚动可能会出错,但是根据您的需要,这是获取图像的好方法。
答案 3 :(得分:0)
使用save_screenshot
的问题在于我们无法以原始质量保存图像,也无法恢复图像中的Alpha通道。因此,我提出了另一种解决方案。这是使用@codam_hsmits建议的selenium-wire
库的完整示例。可以通过ChromeDriver
下载图像。
我定义了以下函数来解析每个请求,并在必要时将请求正文保存到文件中。
from seleniumwire import webdriver # Import from seleniumwire
from urllib.parse import urlparse
import os
from mimetypes import guess_extension
import time
import datetime
def download_assets(requests,
asset_dir="temp",
default_fname="unnamed",
skip_domains=["facebook", "google", "yahoo", "agkn", "2mdn"],
exts=[".png", ".jpeg", ".jpg", ".svg", ".gif", ".pdf", ".bmp", ".webp", ".ico"],
append_ext=False):
asset_list = {}
for req_idx, request in enumerate(requests):
# request.headers
# request.response.body is the raw response body in bytes
if request is None or request.response is None or request.response.headers is None or 'Content-Type' not in request.response.headers:
continue
ext = guess_extension(request.response.headers['Content-Type'].split(';')[0].strip())
if ext is None or ext == "" or ext not in exts:
#Don't know the file extention, or not in the whitelist
continue
parsed_url = urlparse(request.url)
skip = False
for d in skip_domains:
if d in parsed_url.netloc:
skip = True
break
if skip:
continue
frelpath = parsed_url.path.strip()
if frelpath == "":
timestamp = str(datetime.datetime.now().replace(microsecond=0).isoformat())
frelpath = f"{default_fname}_{req_idx}_{timestamp}{ext}"
elif frelpath.endswith("\\") or frelpath.endswith("/"):
timestamp = str(datetime.datetime.now().replace(microsecond=0).isoformat())
frelpath = frelpath + f"{default_fname}_{req_idx}_{timestamp}{ext}"
elif append_ext and not frelpath.endswith(ext):
frelpath = frelpath + f"_{default_fname}{ext}" #Missing file extension but may not be a problem
if frelpath.startswith("\\") or frelpath.startswith("/"):
frelpath = frelpath[1:]
fpath = os.path.join(asset_dir, parsed_url.netloc, frelpath)
if os.path.isfile(fpath):
continue
os.makedirs(os.path.dirname(fpath), exist_ok=True)
print(f"Downloading {request.url} to {fpath}")
asset_list[fpath] = request.url
try:
with open(fpath, "wb") as file:
file.write(request.response.body)
except:
print(f"Cannot download {request.url} to {fpath}")
return asset_list
让我们从Google主页上将一些图像下载到temp
文件夹中。
# Create a new instance of the Chrome/Firefox driver
driver = webdriver.Chrome()
# Go to the Google home page
driver.get('https://www.google.com')
# Download content to temp folder
asset_dir = "temp"
while True:
# Please browser the internet, it will collect the images for every second
time.sleep(1)
download_assets(driver.requests, asset_dir=asset_dir)
driver.close()
请注意,它无法确定哪些图像可以在页面上看到而不是隐藏在后台,因此用户应主动单击按钮或链接以触发新的下载请求。
答案 4 :(得分:0)
给了。
BeautifulSoup
提取图像的宽度和高度driver.set_window_size
设置正确的当前窗口大小,并使用 driver.save_screenshot
截取屏幕截图from bs4 import BeautifulSoup
from selenium import webdriver
import os
from urllib.parse import urlparse
url = 'https://image.rakuten.co.jp/azu-kobe/cabinet/hair1/hb-30-pp1.jpg'
filename = os.path.basename(urlparse(url).path)
filename_png = os.path.splitext(filename)[0] + '.png' # change file extension to .png
opts = webdriver.ChromeOptions()
opts.headless = True
driver = webdriver.Chrome(options=opts)
driver.get(url)
# Get the width and height of the image
soup = BeautifulSoup(driver.page_source, 'lxml')
width = soup.find('img')['width']
height = soup.find('img')['height']
driver.set_window_size(width, height) # driver.set_window_size(int(width), int(height))
driver.save_screenshot(filename_png)
它也适用于 Google 的图像格式 WebP。
参考Downlolad Google’s WebP Images via Take Screenshots with Selenium WebDriver。