Question

我试图抓一个我必须先登录才能到达的网页。但经过身份验证后，我需要的网页需要运行一些Javascript才能查看内容。我所做的是按照here的说明安装splash以尝试渲染Javascript。然而...

在我切换到启动之前，使用Scrapy InitSpider进行身份验证很好。我正在通过登录页面并抓取目标页面确定（显然，除非没有Javascript工作）。但是，一旦我添加代码以通过启动传递请求，看起来我似乎没有解析目标页面。

下面的蜘蛛。 splash版本（此处）与非splash版本之间的唯一区别是函数def start_requests()。两者之间的其他一切都是一样的。

import scrapy
from scrapy.spiders.init import InitSpider
from scrapy.spiders import Rule
from scrapy.linkextractors import LinkExtractor

class BboSpider(InitSpider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    start_urls = [
            "http://www.bridgebase.com/myhands/index.php"
            ]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F" 

    # authentication
    def init_request(self):
        return scrapy.http.Request(url=self.login_page, callback=self.login)

    def login(self, response):
        return scrapy.http.FormRequest.from_response(
            response,
            formdata={'username': 'USERNAME', 'password': 'PASSWORD'},
            callback=self.check_login_response)

    def check_login_response(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
            return self.initialized()
        else:
            self.log("Login failed")
            print(response.body)

    # pipe the requests through splash so the JS renders 
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, self.parse, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            }) 

    # what to do when a link is encountered
    rules = (
            Rule(LinkExtractor(), callback='parse_item'),
            )

    # do nothing on new link for now
    def parse_item(self, response):
        pass

    def parse(self, response):
        filename = 'test.html' 
        with open(filename, 'wb') as f:
            f.write(response.body)

现在发生的是test.html parse()的结果，现在只是登录页面本身而不是我登录后应该重定向到的页面

这在日志中告知 - 通常，我会看到＆＃34;登录成功＆＃34;来自check_login_response()的行，但正如您在下面看到的那样，似乎我甚至没有达到这一步。这是因为scrapy现在也将认证请求放到了引发状态，并且它被挂起了吗？如果是这种情况，有没有办法绕过仅用于身份验证部分的启动？

2016-01-24 14:54:56 [scrapy] INFO: Spider opened
2016-01-24 14:54:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-01-24 14:54:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-01-24 14:55:02 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html> (referer: None)
2016-01-24 14:55:02 [scrapy] INFO: Closing spider (finished)

我非常确定我没有正确使用防溅。任何人都可以向我指出一些文件，我可以找出正在发生的事情吗？

Answer 1

我认为Splash不会单独处理这个特殊情况。

这是一个有用的想法：

使用selenium和PhantomJS headless browser登录网站
将浏览器Cookie 从PhantomJS传递到Scrapy

代码：

import scrapy
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


class BboSpider(scrapy.Spider):
    name = "bbo"
    allowed_domains = ["bridgebase.com"]
    login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F"

    def start_requests(self):
        driver = webdriver.PhantomJS()
        driver.get(self.login_page)

        driver.find_element_by_id("username").send_keys("user")
        driver.find_element_by_id("password").send_keys("password")

        driver.find_element_by_name("submit").click()

        driver.save_screenshot("test.png")
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.LINK_TEXT, "Click here for results of recent tournaments")))

        cookies = driver.get_cookies()
        driver.close()

        yield scrapy.Request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)

    def parse(self, response):
        if "recent tournaments" in response.body:
            self.log("Login successful")
        else:
            self.log("Login failed")
        print(response.body)

打印Login successful和“指针”页面的HTML。

Answer 2

更新

因此，似乎start_requests在登录前触发。

以下是InitSpider的代码，减去评论。

class InitSpider(Spider):
    def start_requests(self):
        self._postinit_reqs = super(InitSpider, self).start_requests()
        return iterate_spider_output(self.init_request())

    def initialized(self, response=None):
        return self.__dict__.pop('_postinit_reqs')

    def init_request(self):
        return self.initialized()

InitSpider使用start_requests调用主initialized。

您的start_requests是基类方法的修改版本。所以也许这样的事情会起作用。

from scrapy.utils.spider import iterate_spider_output

...

def start_requests(self):
    self._postinit_reqs = my_start_requests()
    return iterate_spider_output(self.init_request())

def my_start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, self.parse, meta={
            'splash': {
                'endpoint': 'render.html',
                'args': {'wait': 0.5}
            }
        })

~~您需要return self.initialized()~~

Answer 3

You can get all the data without the need for js at all, there are links available for browsers that do not have javascript enabled, the urls are the same bar >>> import random >>> mylist = range(5) >>> print(mylist) [0, 1, 2, 3, 4] >>> random.shuffle(mylist) >>> print(mylist) [4, 2, 5, 1, 3] >>> while len(mylist) > 0: ... print(mylist.pop()) ... 3 1 5 2 4 >>> print(mylist) []. You just need to parse the queries from the tourney url you are interested in and create a Formrequest.

?offset=0

There are numerous links in the output, for hands you get the import scrapy from scrapy.spiders.init import InitSpider from urlparse import parse_qs, urlparse class BboSpider(InitSpider): name = "bbo" allowed_domains = ["bridgebase.com"] start_urls = [ "http://www.bridgebase.com/myhands/index.php" ] login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2Fmyhands%2Findex.php%3F" def start_requests(self): return [scrapy.FormRequest(self.login_page, formdata={'username': 'foo', 'password': 'bar'}, callback=self.parse)] def parse(self, response): yield scrapy.Request("http://www.bridgebase.com/myhands/index.php?offset=0", callback=self.get_all_tournaments) def get_all_tournaments(self, r): url = r.xpath("//a/@href[contains(., 'tourneyhistory')]").extract_first() yield scrapy.Request(url, callback=self.chosen_tourney) def chosen_tourney(self, r): url = r.xpath("//a[contains(./text(),'Speedball')]/@href").extract_first() query = urlparse(url).query yield scrapy.FormRequest("http://webutil.bridgebase.com/v2/tarchive.php?offset=0", callback=self.get_tourney_data_links, formdata={k: v[0] for k, v in parse_qs(query).items()}) def get_tourney_data_links(self, r): print r.xpath("//a/@href").extract(), you can request each one joining to tview.php?-t=.... and it will give you a table of all the data that is easy to parse, there are also links to http://webutil.bridgebase.com/v2/ associated with each hand in the tables, a snippet of the output from the tview link:

tourney=4796-1455303720-&username=...

The rest of the parsing I will leave to yourself.

使用带有splash的InitSpider：只解析登录页面？

3 个答案: