Python + Selenium firefox webdriver - 从网站中提取图像

时间:2018-03-02 23:59:54

标签: python selenium webdriver

我正在尝试使用以下网址从网页中提取图片: Python 2.7 + Selenium(使用FireFox)+ Beautiful Soup。

页面动态加载,因此,我使用Selenium进行屏幕抓取。前端的一切看起来都很棒,但是,当我加载了所有图像时,我看到了HTML,我无法看到图像的链接。有什么想法可以在这里发生吗?

网站是https://flipp.com/flyers?postal_code=97035, 然后,导航至https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad以查看第一周刊广告(我的工作代码如下)。

为了让事情更加奇怪,我能够在检查器窗口中看到图像正在加载......但我仍然无法在HTML中看到它们。关于这里发生的任何想法,以及如何获取更新的HTML(图像加载后?)

这是我能够从HTML中提取的一组图像(通过附加jpg)。这些仅适用于将鼠标悬停在画布上的弹出窗口。

enter image description here

我想要的是实际构成实际页面/画布的图像。我可以看到它们通过(使用firefox中的流量选项),但由于某些原因它们没有出现在HTML中。有什么想法在这里发生吗?

enter image description here

工作代码:

#import packages
from time import gmtime, strftime,sleep, time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium import webdriver
from selenium.webdriver.common.proxy import Proxy, ProxyType
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
#scraping packages
from bs4 import BeautifulSoup


USAPROXY = "177.84.23.122:3128"
def launch_webdriver(PROXY):
    PROXY = PROXY
    PROXY_HOST = PROXY.rpartition(':')[0]
    PROXY_PORT = PROXY.rpartition(':')[2]
    fp = webdriver.FirefoxProfile()
    # Direct = 0, Manual = 1, PAC = 2, AUTODETECT = 4, SYSTEM = 5
    fp.set_preference("network.proxy.type", 1)
    fp.set_preference("network.proxy.http",PROXY_HOST)
    fp.set_preference("network.proxy.http_port",int(PROXY_PORT))
    fp.set_preference("network.proxy.ssl",PROXY_HOST)
    fp.set_preference("network.proxy.ssl_port",int(PROXY_PORT))
    fp.set_preference("general.useragent.override","whater_useragent")    
    fp.update_preferences()
    return webdriver.Firefox(firefox_profile=fp)




def test():
    driver = launch_webdriver(USAPROXY)
    driver.set_page_load_timeout(11)
    driver.get("https://flipp.com/flyers?postal_code=97035")
    sleep(15)
    driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")
    sleep(5)
    my_html = driver.page_source
    soup = BeautifulSoup(my_html,'lxml')
    tags=soup.findAll('img')  #prints only 3 imgs, there should be 100s
    for tag in tags:print tag
    print soup.prettify()
#execute script
test()

1 个答案:

答案 0 :(得分:0)

我对您的代码进行了一处小改动,将soup = BeautifulSoup(my_html,'lxml')替换为soup = BeautifulSoup(my_html,'html.parser'),如下所示:

  • 代码:

    driver.set_page_load_timeout(11)
    driver.get("https://flipp.com/flyers?postal_code=97035")
    sleep(15)
    driver.get("https://flipp.com/weekly_ad/1550082-big-5-sporting-goods-weekly-ad")
    sleep(5)
    my_html = driver.page_source
    soup = BeautifulSoup(my_html,'html.parser')
    tags=soup.findAll('img')
    for tag in tags:print (tag)
    
  • 输出:

    <img alt="" src="/94815ec0/images/page-favourites.svg"/>
    <img alt="" src="/94815ec0/images/page-flyers.svg"/>
    <img alt="" src="/94815ec0/images/page-coupons.svg"/>
    <img alt="" src="/94815ec0/images/profile.png"/>
    <img alt="" src="/94815ec0/images/signin-google-en.png"/>
    <img alt="" src="/94815ec0/images/signin-facebook-en.png"/>
    <img alt="" class="sl-icon" src="/94815ec0/images/sl/list-icon.svg"/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2143/1399408035/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2143/1399408035/large");'/>
    <img src="/94815ec0/images/location.svg"/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1568365/web_premium/1519664612.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1568365/web_premium/1519664612.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/1417562816/1417562816/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/1417562816/1417562816/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1570217/web_premium/1519767026.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1570217/web_premium/1519767026.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2217/1399408048/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2217/1399408048/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1548763/web_premium/1519408077.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1548763/web_premium/1519408077.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2392/1412008375/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2392/1412008375/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1558209/web_premium/1519940192.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1558209/web_premium/1519940192.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/2175/1399558010/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/2175/1399558010/large");'/>
    <img alt="" class="flyer-thumbnail" cover="true" fit="" href="https://f.wishabi.net/flyers/1553653/web_premium/1519086192.jpg" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://f.wishabi.net/flyers/1553653/web_premium/1519086192.jpg");'/>
    <img alt="" class="logo" contain="true" fit="" href="https://images.wishabi.net/merchants/1415661435/1415661435/large" is="flipp-lazy-image" src="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVQYV2NgYAAAAAMAAWgmWQ0AAAAASUVORK5CYII=" style='background-image: url("https://images.wishabi.net/merchants/1415661435/1415661435/large");'/>
    <img alt="" src="/94815ec0/images/email_notices.png"/>
    <img alt="Flipp logo"/>
    <img alt="image of a sad ice cream" class="sad-cream"/>
    <img alt="Google Chrome Logo" class="browser-img chrome"/>
    <img alt="Mozilla Firefox Logo" class="browser-img ff"/>
    <img alt="Microsoft Edge Logo" class="browser-img edge"/>
    <img alt="Apple Safari Logo" class="browser-img safari"/>
    <img alt="" height="0" id="batBeacon0.08041384361820791" src="https://bat.bing.com/action/0?ti=5463843&amp;Ver=2&amp;mid=e698c347-3982-6279-c6a5-5e5b764b55dd&amp;evt=pageLoad&amp;sid=ab7428c0-1&amp;lt=1647&amp;pi=0&amp;lg=en-US&amp;sw=1366&amp;sh=768&amp;sc=24&amp;tl=Big%205%20Sporting%20Goods%20Weekly%20Ad%20for%20Lake%20Oswego%20this%20week%20(Feb%2025,%202018%20-%20Mar%203,%202018)%20-%20Flipp&amp;kw=flyers,%20coupons,%20shopping%20list,%20deals,%20circulaires,%20coupons,%20liste%20d%E2%80%99achats,%20offres&amp;p=https%3A%2F%2Fflipp.com%2Fweekly_ad%2F1550082-big-5-sporting-goods-weekly-ad&amp;r=&amp;msclkid=N&amp;rn=558478" style="width:0px; height:0px; display:none; visibility:hidden;" width="0"/>