使用Selenium进行Python Web抓取:在模式框中找到一个元素并下载-循环

时间:2018-08-22 10:02:09

标签: python selenium pdf web-scraping downloading

我想为每个组织在Python循环中下载2017年名为“ sprawozdanie merytoryczne”的文件。要手动下载一个文件,您必须转到网站:http://sprawozdaniaopp.mpips.gov.pl/单击按钮“Znajdź”,然后单击组织名称-模式框将显示该特定组织的“ sprawozdanie merytoryczne”链接。我想为所有组织自动执行此操作。但是我遇到了一些问题。在第一次运行循环期间,一切正常,下载了第一个文件。但是当涉及到第二个时,它会打开一个模态窗口,但是尽管存在它也没有看到“ sprawozdanie merytoryczne”。我认为切换到Windows是有问题的。我将非常感谢您的帮助。这是我的代码:

import urllib
import urllib.request
import requests
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
import re
import unicodecsv  # import whole module
import requests  # import whole module
from bs4 import BeautifulSoup  # import only things that we need
import time
import smtplib
from selenium import webdriver
chrome_path= r"C:\Users\username\AppData\Local\Programs\Python\Python35- 
32\Scripts\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("http://sprawozdaniaopp.mpips.gov.pl/")

rok = driver.find_element_by_xpath("//*[@id='instanceYear']")
rok.send_keys('2017') 

wojewodztwo = driver.find_element_by_xpath("//*[@id='Province']")
wojewodztwo.clear()
wojewodztwo.send_keys('MAZOWIECKIE')  
elem = driver.find_element_by_xpath("//*[@id='btnsearch']/span")
elem.click()
for i in range(1, 1348):
    winhandle = driver.current_window_handle
    p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
    p2 = ') > td:nth-child(3) > a'
    p3 = p1 + str(i) + p2
    elem1 = driver.find_element_by_css_selector(p3)
    p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
    p2 = ') > td:nth-child(5)'
    p3 = p1 + str(i) + p2
    miejscowosc = driver.find_element_by_css_selector(p3)
    print(miejscowosc.text) #miejscowosc means city
    miejscowosc1=miejscowosc.text
    p1 = r'#form1 > div > div.grid > table > tbody > tr:nth-child('
    p2 = ') > td:nth-child(4)'
    p3 = p1 + str(i) + p2
    wojewodztwo = driver.find_element_by_css_selector(p3)
    elem1.click()

    WebDriverWait(driver, 
    10).until(EC.presence_of_element_located((By.CSS_SELECTOR,".ui- 
    dialog.ui-widget.ui-widget-content.ui-corner-all")))


    try:
        elem2 = driver.find_element_by_link_text("Sprawozdanie 
        merytoryczne").click()
        organizationName = driver.find_elements_by_class_name("td1")
        orgname = str(organizationName[11].text)

        orgname1 = orgname.replace('"', "")
        print(organizationName[11].text)

        driver.switch_to.window(driver.window_handles[1])
        urltemp = driver.current_url
        urltodownload=  requests.get(urltemp)

        path1 = r'C:/Users/adunajsk/Desktop/pdf17maz/'
        path2 = '.pdf'
        path3 = path1 + orgname1 + path2
        print(path3)
        with open(path3, 'wb') as f:
                f.write(urltodownload.content)
        driver.close()

        del organizationName[:] 
    except NoSuchElementException:
        print("Plik nie istnieje")

    driver.switch_to.window(winhandle)

    WebDriverWait(driver, 
    8).until(EC.visibility_of_element_located((By.CSS_SELECTOR, "body 
    > div.ui-dialog.ui-widget.ui-widget-content.ui-corner-all > 
    div.ui-dialog-titlebar.ui-widget-header.ui-corner-all.ui-helper- 
    clearfix > a > span")))

    closebutton= driver.find_element_by_css_selector("body > div.ui- 
    dialog.ui-widget.ui-widget-content.ui-corner-all > div.ui-dialog- 
    titlebar.ui-widget-header.ui-corner-all.ui-helper-clearfix > a")
    closebutton.click()

1 个答案:

答案 0 :(得分:0)

问题是,一旦打开模式对话框,即使将其关闭,它也将保留在DOM中。当您打开第二个定位器时,找到第一个定位器,然后尝试单击此处。 您也可以配置驱动程序以直接下载pdf,而无需打开它。

此处代码:

否:我使用Java进行了编码和测试,代码可能包含语法错误

    #set chrome options to download pdf instead open it in browser, this will remove need to handle windows and make it much faster
    options = webdriver.ChromeOptions()
    downloadPath = r'C:\Users\username\Downloads'
    profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],"download.default_directory" : downloadPath}
    options.add_experimental_option("prefs",profile)
    driver = webdriver.Chrome(r"C:\Users\username\AppData\Local\Programs\Python\Python35-32\Scripts\chromedriver.exe", chrome_options=options)

    driver.get("http://sprawozdaniaopp.mpips.gov.pl/")
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located(By.ID, 'Province')).send_keys('MAZOWIECKIE')
    driver.find_element_by_id('instanceYear').send_keys('2017')
    driver.find_element_by_id('btnsearch').click()

    #after search wait table to load data with column with MAZOWIECKIE text
    WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, '//table[@class="webgrid"]/tbody//td[normalize-space(.)="MAZOWIECKIE"]')))

    #get all rows and iterate throw, make your code dinamically and not depends row size
    rows = driver.find_elements_by_css_selector('table.webgrid tbody tr');
    for row in rows:
        #get KRS column number
        krs = row.find_element_by_css_selector('td:nth-child(2)').text()
        #click to link in Nazwa column
        row.find_element_by_css_selector('td:nth-child(3) a').click()
        #find modal box DIV element with KRS numeber got from click row. as option you can get all modal boxes and get one visible.
        modalBoxLocator = "(//table[@id='tbldetails']//td[contains(.,'" + krs + "')]/ancestor::div[contains(@class,'ui-dialog')][2])[last()]"  
        modalBox = WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.XPATH, modalBoxLocator)))
        #find TD with 2017 text and then click on first "Sprawozdanie merytoryczne" link after 2017
        modalBox.find_element_by_xpath('.//tr[./td[.='2017']]/following-sibling::tr[.//a[.="Sprawozdanie merytoryczne"]][1]//a').click()
        #close modal box
        modalBox.find_element_by_css_selector('a.ui-dialog-titlebar-close').click()

        #if modalBox.find_elements_by_css_selector('a.ui-dialog-titlebar-close').size()>0:
        #   modalBox.find_element_by_css_selector('a.ui-dialog-titlebar-close').click()