我正在尝试编写一个脚本,以从BlackRock网站(ishares.com或blackrock.com)下载特定的PDF文件,但是click()函数通常不起作用。不过有时候还是可以的-每3-5次执行一次,它就能下载一个文件。
(当我对那些网站上的所有PDF使用类似的脚本时,它在几次执行中也只能工作一次,并且每次工作都总是下载相同的文件,而跳过其余的文件。)
因此,假设我尝试从这些网站下载KIID / KID PDF文件:
https://www.ishares.com/uk/individual/en/products/251857/ishares-msci-emerging-markets-ucits-etf-inc-fund?switchLocale=y&siteEntryPassthrough=true
https://www.ishares.com/ch/individual/en/products/251931/ishares-stoxx-europe-600-ucits-etf-de-fund?switchLocale=y&siteEntryPassthrough=true
https://www.blackrock.com/uk/individual/products/251565/ishares-euro-corporate-bond-large-cap-ucits-etf?switchLocale=y&siteEntryPassthrough=true
使用以下代码:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyvirtualdisplay import Display
import time
def blackrock_getter(url):
with Display():
mime_types = "application/pdf,application/vnd.adobe.xfdf,application/vnd.fdf,application/x-pdf,application/vnd.adobe.xdp+xml"
profile = webdriver.FirefoxProfile()
profile.set_preference('browser.download.folderList', 2)
profile.set_preference('browser.download.manager.showWhenStarting', False)
profile.set_preference('browser.download.dir', '/home/user/kiid_temp')
profile.set_preference('browser.helperApps.neverAsk.saveToDisk', mime_types)
profile.set_preference("plugin.disable_full_page_plugin_for_types", mime_types)
profile.set_preference('pdfjs.disabled', True)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get(url)
try:
element = WebDriverWait(driver, 20).until(
EC.element_to_be_clickable((By.XPATH, ("//header[@class='main-header']//a[@class='icon-pdf'][1]"))))
driver.execute_script("arguments[0].click();", element)
finally:
driver.quit()
time.sleep(3) # very precise mechanism to wait until the download is complete
def main():
urls_file = open('urls_list.txt', 'r') # the URLs I pasted above
for url in urls_file.readlines():
if url[-1:] == "\n":
url = url[:-1]
if url[0:4] == "http":
filename = url.split('?')[0]
filename = filename.split('/')[-1]
if 'blackrock.com/' in url or 'ishares.com/' in url:
print(f"Processing {filename}...")
blackrock_getter(url)
main()
结果是(每隔一段时间)一个文件:kiid-ishares-msci-emerging-markets-ucits-etf-dist-gb-ie00b0m63177-zh.pdf。
有什么办法解决这个问题吗?
答案 0 :(得分:0)
您可以尝试使用pyautogui
模块,但是在程序运行时将无法使用计算机。
答案 1 :(得分:0)
似乎文件下载完成之前脚本已完成,我的意思是下载在3秒内没有竞争。这是等待PDF下载完成的方法。
# method to get the downloaded file name
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time()+waitTime
while True:
try:
# get downloaded percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# return the file name once the download is completed
return driver.execute_script("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text")
except:
pass
time.sleep(1)
if time.time() > endTime:
break