Question

我使用Selenium和Python以及无头Chrome驱动程序来搜索此网站上的所有图片：。总结我使用的代码：

https://ih0.redbubble.net/image.420357355.0428/ra%2Clongsleeve%2Cx925%2C101010%3A01c5ca27c6%2Cfront-c%2C210%2C180%2C210%2C230-bg%2Cf8f8f8.lite-1u1.jpg

这是一个有效的链接：

data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAaQAAAHMAQMAAACgJU5BAAAAA1BMVEUAAACnej3aAAAAAXRSTlMAQObYZgAAAC9JREFUeNrtwTEBAAAAwiD7p7bETmAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACRA2EIAAF8YGbpAAAAAElFTkSuQmCC

和无效的：

{{1}}

有时，我得到的所有链接中有近50％是无效的，有时这个数字几乎为零，尽管我使用相同的页面网址相同的代码。谁能解释一下这个现象给我？非常感谢。

Answer 1

无效的IMG并非真正无效，它只是一个占位符图像。它是base64图像。您可以谷歌了解更多信息，但它基本上是一个基于文本的代码，转换为实际图像（PNG）。在这种情况下，它的字符串太小而不是有价值的东西。根据我在网站上看到的内容，它会自动加载前16个T恤的IMG，其余的是占位符。向下滚动（根据我的经验），其余部分将被加载（占位符将替换为实际的图像URL）。

您应该可以点击该页面，向下滚动，然后加载所有网址，如果这是您拍摄的内容。

您可以通过将定位器更改为CSS选择器来过滤掉这些内容，例如

img.shared-components-ShopSearchResultsGridImage-ShopSearchResultsGridImage__primary--3pEtg[src^='http']

这只会找到具有所需类的IMG标记，其src值以http开头（允许http和https同时使用）。

您可以更进一步，比较以src开头但没有http的IMG标记数量，看看有多少IMG标记没有加载爱好。

Answer 2

我通过将页面加载策略设置为＆＃34; none＆＃34;来解决了这个问题。并尝试重复获取所有链接，直到所有链接都有效，然后停止加载页面以节省时间。希望这对某人有用。

def quick_load():
    global url, page_num, file
    chromeOptions = webdriver.ChromeOptions()
    chromeOptions.add_argument('headless')
    capa = DesiredCapabilities.CHROME
    capa["pageLoadStrategy"] = "none"
    driver = webdriver.Chrome(desired_capabilities=capa,
                                  chrome_options=chromeOptions)

    while <url is valid>:
        done_page = False
        driver.execute_script("window.open('%s');" % url)
        driver.switch_to_window(driver.window_handles[-1])
        while not done_page:
            driver.execute_script("window.scroll(0, 5000)")
            time.sleep(1)
            new_links = driver.find_elements_by_class_name(
                "shared-components-ShopSearchResultsGridImage-ShopSearchResultsGridImage__primary--3pEtg")
            im_links = [extract(l.get_attribute('src')) for l in new_links]
            im_links = [l for l in im_links if l != 'invalid']
            if len(im_links) > 100:
                done_page = True
                driver.execute_script("window.stop();")
                <do something>
                <update url>

        driver.execute_script("window.close();")
        driver.switch_to_window(driver.window_handles[0])
    driver.close()
    driver.quit()

硒不稳定的结果

2 个答案: