使用Selenium和Requests模块从Python3中的网页获取文件

时间:2017-01-07 16:24:18

标签: python selenium beautifulsoup python-requests

我希望在我遇到的问题上得到一些帮助。 我是python的新手,并且一直在使用Al Sweigart的“使用Python自动化无聊的东西”,以便简单地完成一些非常繁琐的工作。

以下是我遇到的问题的概述: 我正在尝试访问网页并使用Requests和BeautifulSoup模块来解析网站,获取我需要的文件的URL,并下载这些文件。 除了一个小问题外,这个过程效果很好......页面中有一个ReportDropDown选项可以过滤显示的结果。我遇到的问题是,即使使用新信息更新了网页结果,网页网址也不会更改,我的requests.get()只会从默认过滤器中获取信息。

因此,为了解决这个问题,我尝试使用Selenium来更改报告选择...这也很有效,除了我无法从我打开的Selenium浏览器实例中获取Requests模块。

所以看起来我可以使用Requests和BeautifulSoup来获取'默认'页面下拉过滤器的信息,我可以使用Selenium来更改ReportDropDown选项,但我无法将这两个结合起来。

第1部分:

#! python3
import os, requests, bs4
os.chdir('C:\\Standards')
standardURL = 'http://www.nerc.net/standardsreports/standardssummary.aspx'
res = requests.get(standardURL)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

# this is the url pattern when inspecting the elements on the page
linkElems = soup.select('.style97 a')

# I wanted to save the hyperlinks into a list
splitStandards = []
for link in range(len(linkElems)):
    splitStandards.append(linkElems[link].get('href'))

# Next, I wanted to create the pdf's and copy them locally
print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
for item in splitStandards:
    j = os.path.basename(item)      # BAL-001-2.pdf, etc...
    f = open(j, 'wb')
    ires = requests.get(item)
    # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf
    ires.raise_for_status()
    for chunk in ires.iter_content(1000000):    # 1MB chunks
        f.write(chunk)
    print('Completing download for: ' + str(j) + '.')
    f.close()
print()
print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))

除了我无法更改ReportDropDown选项然后使用请求来提取新页面信息之外,此模式非常有效。我已经使用requests.get(),requests.post(url,data = {}),selenium-requests等进行了修改......

第2部分:

使用Selenium似乎很简单但我无法从正确的浏览器实例中获取requests.get()。此外,我不得不制作一个Firefox配置文件(seleniumDefault),它有一些aboug:config更改...(windows + r,firefox.exe -p)。 更新:about:config更改是临时设置browser.tabs.remote.autostart = True

from selenium import webdriver

# I used 'fp' to use a specific firefox profile
fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
browser = webdriver.Firefox(fp)
browser.get('http://www.nerc.net/standardsreports/standardssummary.aspx')

# There are 5 possible ReportDropDown selections but I only wanted 3 of them (current, future, inactive).
# In the html code, after a selection is made, it reads as: option selected="selected" value="5" -- where 'value' is the selection number

currentElem = browser.find_elements_by_tag_name('option')[0]
futureElem = browser.find_elements_by_tag_name('option')[1]
inactiveElem = browser.find_elements_by_tag_name('option')[4]

# Using the above code line for "browser.get()" and then currentElem.click(), or futureElem.click(), or inactiveElem.click() correctly changes the page selection. Apparently the browser.get() is needed to refresh the page data before making a new option selection.
# Note: changing the ReportDropDown option doesn't alter the page URL path

所以,我的最终问题是,我如何选择页面并为每个页面提取适当的数据?

我的偏好是只使用请求和bs4模块,但如果我要使用selenium,那么如何才能从我打开的selenium浏览器实例中获取请求?

我试图尽可能彻底地完成我对python的新手,所以任何帮助都会非常感激。 另外,既然我还在学习很多东西,那么任何新手 - 中级水平的解释都会摇滚,谢谢!

=============================================== =========

再次感谢您的帮助,它让我越过挡住我的墙。 这是最终产品......在获取信息之前,我必须为完全加载的所有内容添加一些睡眠语句。

最终版本修订:

#! python3

# _nercTest.py - Opens the nerc.net website and pulls down all
# pdf's for the present, future, and inactive standards.

import os, requests, bs4, time, datetime
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

os.chdir('C:\\Standards')

def nercStandards(standardURL):
    logFile = open('_logFile.txt', 'w')
    logFile.write('Standard\t\tHyperlinks or Errors\t\t' +
                  str(datetime.datetime.now().strftime("%m-%d-%Y %H:%M:%S")) + '\n\n')
    logFile.close()
    fp = webdriver.FirefoxProfile('C:\\pathto\\Firefox\\Profiles\\seleniumDefault')
    browser = webdriver.Firefox(fp)
    wait = WebDriverWait(browser, 10)

    currentOption = 'Mandatory Standards Subject to Enforcement'
    futureOption = 'Standards Subject to Future Enforcement'
    inactiveOption = 'Inactive Reliability Standards'

    dropdownList = [currentOption, futureOption, inactiveOption]

    print()
    print(' STARTING STANDARDS DOWNLOAD '.center(80, '=') + '\n')
    for option in dropdownList:
        standardName = []   # Capture all the standard names accurately
        standardLink = []   # Capture all the href links for each standard
        standardDict = {}   # combine the standardName and standardLink into a dictionary 
        browser.get(standardURL)
        dropdown = Select(browser.find_element_by_id("ReportDropDown"))
        dropdown.select_by_visible_text(option)
        wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, 'div > span[class="style12"]'), option))

        time.sleep(3)   # Needed for the 'inactive' page to completely load consistently
        page_source = browser.page_source
        soup = bs4.BeautifulSoup(page_source, 'html.parser')
        soupElems = soup.select('.style97 a')

        # standardLink list generated here
        for link in range(len(soupElems)):
            standardLink.append(soupElems[link].get('href'))
            # http://www.nerc.com/pa/Stand/Reliability%20Standards/BAL-001-2.pdf

        # standardName list generated here
        if option == currentOption:
            print(' Mandatory Standards Subject to Enforcement '.center(80, '.') + '\n')
            currentElems = soup.select('.style99 span[class="style30"]')
            for currentStandard in range(len(currentElems)):
                   standardName.append(currentElems[currentStandard].getText())
                   # BAL-001-2
        elif option == futureOption:
            print()
            print(' Standards Subject to Future Enforcement '.center(80, '.') + '\n')
            futureElems = soup.select('.style99 span[class="style30"]')
            for futureStandard in range(len(futureElems)):
                   standardName.append(futureElems[futureStandard].getText())
                   # COM-001-3       
        elif option == inactiveOption:
            print()
            print(' Inactive Reliability Standards '.center(80, '.') + '\n')
            inactiveElems = soup.select('.style104 font[face="Verdana"]')
            for inactiveStandard in range(len(inactiveElems)):
                   standardName.append(inactiveElems[inactiveStandard].getText())
                   # BAL-001-0

        # if nunber of names and links match, then create key:value pairs in standardDict
        if len(standardName) == len(standardLink):
            for x in range(len(standardName)):
                standardDict[standardName[x]] = standardLink[x]
        else:
            print('Error: items in standardName and standardLink are not equal!')
            logFile = open('_logFile.txt', 'a')
            logFile.write('\nError: items in standardName and standardLink are not equal!\n')
            logFile.close()

        # URL correction for PRC-005-1b
        # if 'PRC-005-1b' in standardDict:
        #     standardDict['PRC-005-1b'] = 'http://www.nerc.com/files/PRC-005-1.1b.pdf'

        for k, v in standardDict.items():
            logFile = open('_logFile.txt', 'a')
            f = open(k + '.pdf', 'wb')
            ires = requests.get(v)
            try:
                ires.raise_for_status()
                logFile.write(k + '\t\t' + v + '\n')
            except Exception as exc:
                print('\nThere was a problem on %s: \n%s' % (k, exc))
                logFile.write('There was a problem on %s: \n%s\n' % (k, exc))
            for chunk in ires.iter_content(1000000):
                    f.write(chunk)
            f.close()
            logFile.close()
            print(k + ': \n\t' + v)
    print()
    print(' STANDARDS DOWNLOAD COMPLETE '.center(80, '='))

nercStandards('http://www.nerc.net/standardsreports/standardssummary.aspx')

2 个答案:

答案 0 :(得分:1)

使用Selenium单击按钮等完成工作后,您需要告诉BeautifulSoup使用它:

    page_source = browser.page_source
    link_soup = bs4.BeautifulSoup(page_source,'html.parser')

答案 1 :(得分:1)

@HenryM在正确的轨道上,除了在您阅读.page_source并将其传递给BeautifulSoup进行进一步解析之前,您需要确保所需的数据已加载。为此,请使用WebDriverWait class

例如,在您选择“标准归档和待批准的法规批准”选项后,您需要等待报告标题更新 - 这将表明您已加载新结果。这些方面的东西:

from bs4 import BeautifulSoup

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select

# ...

wait = WebDriverWait(browser, 10)

option_text = "Standards Filed and Pending Regulatory Approval" 

# select the dropdown value
dropdown = Select(browser.find_element_by_id("ReportDropDown"))
dropdown.select_by_visible_text(option_text)

# wait for results to be loaded
wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, "#panel5 > div > span"), option_text)

soup = BeautifulSoup(browser.page_source,'html.parser')
# TODO: parse the results

另请注意使用Select class来操作下拉列表。