如何让scrapy表单提交工作

时间:2015-01-20 06:22:39

标签: python forms web-scraping scrapy scrapy-spider

我尝试使用scrapy来完成登录并收集我的项目提交计数。这是代码。

from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import Spider
from scrapy.utils.response import open_in_browser


class GitSpider(Spider):
    name = "github"
    allowed_domains = ["github.com"]
    start_urls = ["https://www.github.com/login"]

    def parse(self, response):
        formdata = {'login': 'username',
                'password': 'password' }
        yield FormRequest.from_response(response,
                                        formdata=formdata,
                                        clickdata={'name': 'commit'},
                                        callback=self.parse1)

    def parse1(self, response):
        open_in_browser(response)

运行代码后

scrapy runspider github.py

它应该显示表单的结果页面,该页面应该是用户名和密码为假的同一页面中的登录失败。但是它显示了search page。日志文件位于pastebin

如何修复代码?提前谢谢。

3 个答案:

答案 0 :(得分:9)

您的问题是FormRequest.from_response()使用了不同的表单 - 搜索表单"。但是,您希望它使用"登录表单"代替。提供formnumber参数:

yield FormRequest.from_response(response,
                                formnumber=1,
                                formdata=formdata,
                                clickdata={'name': 'commit'},
                                callback=self.parse1)

以下是我在应用更改后在浏览器中看到的内容(使用"假"用户):

enter image description here

答案 1 :(得分:3)

使用webdriver的解决方案。

from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
from scrapy.contrib.spiders import CrawlSpider

class GitSpider(CrawlSpider):

    name = "gitscrape"
    allowed_domains = ["github.com"]
    start_urls = ["https://www.github.com/login"]

    def __init__(self):
        self.driver = webdriver.Firefox()

    def parse(self, response):
        self.driver.get(response.url)
        login_form = self.driver.find_element_by_name('login')
        password_form = self.driver.find_element_by_name('password')
        commit = self.driver.find_element_by_name('commit')
        login_form.send_keys("yourlogin")
        password_form.send_keys("yourpassword")
        actions = ActionChains(self.driver)
        actions.click(commit)
        actions.perform()
        # by this point you are logged to github and have access 
        #to all data in the main menù
        time.sleep(3)
        self.driver.close()

答案 2 :(得分:0)

使用“ formname”参数也可以:

yield FormRequest.from_response(response,
                            formname='Login',
                            formdata=formdata,
                            clickdata={'name': 'commit'},
                            callback=self.parse1)