网页抓取图片网址返回为“

时间:2020-09-09 08:16:38

标签: python beautifulsoup

我认为我的问题是页面上运行的javascript,直到我向下滚动才加载图像。有人可以帮我吗?脚本运行良好,直到我点击“ ZendikarRising(ZNR)”为止,该页面上有很多图像。然后我被告知无法从URL中保存imageMakindi Ox(ZNR).png ...它应该说一个URL,但返回''我合并了一些调试代码以绕过丢失的卡URL,但我却丢失了很多。

我尝试删除空字段,但是如果运行它,您会看到我的卡名和URL数量偶数(其中一些为空白),因此删除空URL会抛出总数,并导致我丢失集合中的卡片。

这是有问题的代码

import requests
import os
from os.path import basename
from bs4 import BeautifulSoup
 
path = os.getcwd()
print ("The current working directory is %s" % path)
 
url = 'https://scryfall.com/sets'
r=requests.get(url).text
soup = BeautifulSoup(r, 'html.parser')
 
####################GATHERS ALL URLS FROM SET DIRECTORY#####################
links = []
Urls = []
for link in soup.findAll('a'):
    links.append(link.get('href'))
 
for link in links:
    if link != None:
        if 'https://scryfall.com/sets/' in link:
            if link not in Urls:
                Urls.append(link)
 
#################START OF ALL URL LOOPS################################
for Url in Urls: ##goes threw all the URLS gathered from the sets links
    r=requests.get(Url).text
    soup = BeautifulSoup(r, 'html.parser')
 
    temp = soup.find('h1', {'class': 'set-header-title-h1'}).contents
    temp = ''.join(temp)
    temp = temp.strip()
    temp = temp.replace(':', '')
    temp = temp.replace(' ', '')
 
    test2 = (f"{path}\\{temp}")
#############################################MAKE DIRECTORY FOR SET FOLDERS##################
    try:
        os.mkdir(test2)
    except OSError:
        print ("Creation of the directory %s failed" % test2)
    else:
        print ("Successfully created the directory %s " % test2)
 
############################################GATHER ALL IMAGES####################
    images = soup.find_all('img')
 
    pictures = [] ##stores all the picture URLS
    names = [] ##stores all the name
 
    for image in images[:-1]:
        names.append(image.get('alt'))
        pictures.append(image.get('src'))
####################SAVES ALL IMAGES AS FILES#################
 
    x=0
    for i in pictures:
        fn = names[x] + '.png'
        try:
            with open(f'{test2}\\'+basename(fn),"wb") as f:
 
                f.write(requests.get(i).content)
                f.close
                ##print(i)
                ##print(f'saved {fn} to {path}')
                x+=1
        except OSError:
            print(f"Failed to save image{fn} from url{i}")
            print(len(pictures))
            print(len(names))
            exit()
##################RESETS IMAGES AND NAMES FOR NEXT SET FOLDER#############
 
    pictures.clear()
    names.clear()
Print("Completed With No Errors")

1 个答案:

答案 0 :(得分:1)

实际上,图像是由JS脚本延迟加载的,尽管在页面后面没有发现具有<img>属性的src标签。

但是,解决方案非常简单。如果查看未加载的多个<img>标签,您会发现图像链接不在src属性中,而是在data-src属性中。

例如:

<img alt="Wayward Guide-Beast (ZNR)" class="card znr border-black" data-component="lazy-image" data-src="https://c1.scryfall.com/file/scryfall-cards/normal/front/e/b/ebfe94fc-7a98-4f53-8fd0-f5fd016b1873.jpg?1599472001" src="" title="Wayward Guide-Beast (ZNR)"/>

因此,您所需要做的就是检查src是否为空,如果是,请刮除data-src属性。