嗨,我正在尝试从 BGS 钻孔扫描中下载图像,其中有不止一页,例如http://scans.bgs.ac.uk/sobi_scans/boreholes/795279/images/10306199.html http://scans.bgs.ac.uk/sobi_scans/boreholes/18913699/images/18910430.html
我设法下载了第一个示例的前 2 页,但是当我到达最后一页时出现此错误。在此页面上,NextPage 变量应为 None,因为该标签不在网页上。此时我想继续到下一个位置,我还没有添加,但我有一个 URL 的 excel 列表。代码基于此https://automatetheboringstuff.com/2e/chapter12/
回溯(最近一次调用最后一次): 文件“C:/Users/brentond/Documents/Python/Pdf BGS Scans.py”,第 73 行,在 NextPage = soup.select('a[title="下一页"]')[0] 索引错误:列表索引超出范围
import pyautogui
import pyperclip
import webbrowser
import PyPDF2
import os
import openpyxl
import pdfkit
import requests
import bs4
# Define path of excel file
from requests import Response
path = r'C:\Users\brentond\Documents\TA2'
# Change directory to target location
os.chdir(path)
# Create workbook object
wb = openpyxl.load_workbook('BGS Boreholes.xlsm')
# Create worksheet object
ws = wb.get_sheet_by_name('Open')
# Assign URL to variable
StartURL = ws['A2'].value
URL = StartURL
NextURL = "NextURL"
# Assign BH ID to variable
Location = ws['B2'].value
while NextURL is not None:
# Download URL
res = requests.get(URL) # type: Response
res.raise_for_status()
# Create beautiful soup object
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# Find the URL of the borehole scan image.
Scan = soup.select('#image_content img')
# Check on HTML elements
Address = soup.select('#image')
AddressText = Address[0].get('src')
print(AddressText)
print()
if Scan == []:
print('Could not find scan image.')
else:
ScanUrl = Scan[0].get('src')
# Download the image.
print('Downloading image %s...' % (ScanUrl))
res = requests.get(ScanUrl)
res.raise_for_status()
# Save the image to path
PageNo = 0
imageFile = open(os.path.join(path, Location) + "-Page" + str(PageNo) + ".png", 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
# Find URL for next page
PageNo = PageNo + 1
NextPage = soup.select('a[title="Next page"]')[0]
if NextPage ==[]:
continue
else:
print(NextPage)
NextURL = NextPage.get('href')
URL = NextURL
print(NextURL)
print('Done.')
答案 0 :(得分:1)
如果它不存在,您将无法选择它的第一个元素。您可以先尝试使用 find
/ find_all
验证元素是否存在,或者您可以使用 try
/except
来解释 IndexError 并在错误情况下修改您的脚本行为.
答案 1 :(得分:1)
因此,由于锚点不存在,soup.select('a[title="Next page"]')
将始终返回一个空列表。因此,关键零将不存在,因此引发了 IndexError。
最容易改变的事情
NextPage = soup.select('a[title="Next page"]')[0]
if NextPage ==[]:
continue
else:
print(NextPage)
NextURL = NextPage.get('href')
到
NextPage = soup.select('a[title="Next page"]')
if not NextPage:
continue
else:
NextPage = NextPage[0]
print(NextPage)
NextURL = NextPage.get('href')
或
NextPage = soup.select('a[title="Next page"]')
if not NextPage:
continue
else:
print(NextPage[0])
NextURL = NextPage[0].get('href')
看个人喜好