使用selenium进行Scrapy处理需要身份验证的网页

时间:2015-02-09 21:50:16

标签: python selenium scrapy

我正在尝试从具有大量AJAX调用和javascript执行的页面中抓取数据来呈现网页。所以我正在尝试使用selenium来执行此操作。作案手法如下:

  1. 将登录页面网址添加到scrapy start_urls列表

  2. 使用响应方法中的formrequest发布用户名和密码以进行身份​​验证。

  3. 登录后,请求删除所需的页面
  4. 将此响应传递给Selenium Webdriver以单击页面上的按钮。
  5. 点击按钮并渲染新网页后,抓取结果。
  6. 我到目前为止的代码如下:

     from scrapy.spider import BaseSpider
     from scrapy.http import FormRequest, Request
     from selenium import webdriver
     import time
    
    
     class LoginSpider(BaseSpider):
        name = "sel_spid"
        start_urls = ["http://www.example.com/login.aspx"]
    
    
        def __init__(self):
            self.driver = webdriver.Firefox()
    
    
        def parse(self, response):
            return FormRequest.from_response(response,
                   formdata={'User': 'username', 'Pass': 'password'},
                   callback=self.check_login_response)
    
        def check_login_response(self, response):
            if "Log Out" in response.body:
                self.log("Successfully logged in")
                scrape_url = "http://www.example.com/authen_handler.aspx?SearchString=DWT+%3E%3d+500"
                yield Request(url=scrape_url, callback=self.parse_page)
            else:
                self.log("Bad credentials")
    
        def parse_page(self, response):
            self.driver.get(response.url)
            next = self.driver.find_element_by_class_name('dxWeb_pNext')
            next.click()
            time.sleep(2)
            # capture the html and store in a file
    

    我到目前为止遇到的两个障碍是:

    1. 步骤4不起作用。每当selenium打开firefox窗口时,它总是在登录屏幕上,不知道如何通过它。

    2. 我不知道如何实现第5步

    3. 非常感谢任何帮助

2 个答案:

答案 0 :(得分:2)

我不相信你可以在scrapy Requests和selenium之间切换。您需要使用selenium登录该站点,而不是产生Request()。您使用scrapy创建的登录会话不会转移到selenium会话。这是一个例子(元素ids / xpath对你来说会有所不同):

    scrape_url = "http://www.example.com/authen_handler.aspx"
    driver.get(scrape_url)
    time.sleep(2)
    username = self.driver.find_element_by_id("User")
    password =  self.driver.find_element_by_name("Pass")
    username.send_keys("your_username")
    password.send_keys("your_password")
    self.driver.find_element_by_xpath("//input[@name='commit']").click()

然后你可以这样做:

    time.sleep(2)
    next = self.driver.find_element_by_class_name('dxWeb_pNext').click()
    time.sleep(2)

等。

编辑:如果你需要渲染javascript并担心速度/非阻塞,你可以使用http://splash.readthedocs.org/en/latest/index.html这应该做的伎俩。

http://splash.readthedocs.org/en/latest/scripting-ref.html#splash-add-cookie有关于传递cookie的详细信息,您应该能够从scrapy传递它,但我之前没有这样做过。

答案 1 :(得分:0)

首先使用scrapy api登录

# call scrapy post request with after_login as callback
    return FormRequest.from_response(
        response,
        # formxpath=formxpath,
        formdata=formdata,
        callback=self.browse_files
    )

将会话传递给selenium chrome驱动程序

# logged in previously with scrapy api   
def browse_files(self, response):
    print "browse files for: %s" % (response.url)

    # response.headers        
    cookie_list2 = response.headers.getlist('Set-Cookie')
    print cookie_list2

    self.driver.get(response.url)
    self.driver.delete_all_cookies()

    # extract all the cookies
    for cookie2 in cookie_list2:
        cookies = map(lambda e: e.strip(), cookie2.split(";"))

        for cookie in cookies:
            splitted = cookie.split("=")
            if len(splitted) == 2:
                name = splitted[0]
                value = splitted[1]
                #for my particular usecase I needed only these values
                if name == 'csrftoken' or name == 'sessionid':
                    cookie_map = {"name": name, "value": value}
                else:
                    continue
            elif len(splitted) == 1:
                cookie_map = {"name": splitted[0], "value": ''}
            else:
                continue

            print "adding cookie"
            print cookie_map
            self.driver.add_cookie(cookie_map)

    self.driver.get(response.url)

    # check if we have successfully logged in
    files = self.wait_for_elements_to_be_present(By.XPATH, "//*[@id='files']", response)
    print files