Scrapy - 在引用的URL上调用POST请求而不是初始的

时间:2017-05-21 13:46:13

标签: python scrapy web-crawler

我正在提交FormRequest来更改多页结果的页码。

当我使用scrapy shell时,Post请求会通过:

> `2017-05-21 22:44:19 [scrapy.core.engine] INFO: Spider opened
> 2017-05-21 22:44:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> http://www.australianschoolsdirectory.com.au/robots.txt> (referer:
> None) 2017-05-21 22:44:22 [scrapy.core.engine] DEBUG: Crawled (200)
> <POST http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:27 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:39 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:43 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:46 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True`

使用此请求序列:

>>> from scrapy.http import FormRequest
>>> url = 'http://www.australianschoolsdirectory.com.au/search-result.php'
>>> for i in range(1, 6):
...     payload={'pageNum': str(i)}
...     r = FormRequest(url, formdata=payload)
...     fetch(r)
...     view(response)

但是当我将post请求实施到我的scrapy代码中时,post会被引回到初始搜索网站。

`2017-05-21 22:58:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.australianschoolsdirectory.com.au/robots.txt> (referer: None)
2017-05-21 22:58:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.australianschoolsdirectory.com.au/search-result.php> (referer: None)
2017-05-21 22:58:46 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.australianschoolsdirectory.com.au/**search.php>** (referer: http://www.australianschoolsdirectory.com.au/search-result.php)`

当然search.php没有我正在寻找的数据。为什么我的代码中的Post将其引用回搜索而不是shell?如何在仍然进入下一组结果的同时停止推荐? Scrapy代码:

from scrapy.http import FormRequest
from scrapy.spiders import Spider

class Foo(Spider):
    name = "schoolsTest"
    allowed_domains = ["australianschoolsdirectory.com.au"]
    start_urls = ["http://www.australianschoolsdirectory.com.au/search-result.php"]

    def parse(self, response):
        yield FormRequest.from_response(response, formdata={'pageNum': str(5), 'search': 'true'}, callback=self.parse1)

    def parse1(self, response):
        print response.url

1 个答案:

答案 0 :(得分:1)

首先,您不需要使用from_response(因为您不处理表单),您可以使用scrapy start_requests方法:

import scrapy

class Foo(scrapy.Spider):
    name = "schoolsTest"

    def start_requests(self):
        url = "http://www.australianschoolsdirectory.com.au/search-result.php"
        # Change 5 to 488 to parse all search result
        for i in range(1, 5):
            payload = {'pageNum': str(i)}
            yield scrapy.FormRequest(url, formdata=payload)

    def parse(self, response):
        # Extract all links from search page and make absolute urls
        links = response.xpath('//div[@class="listing-header"]/a/@href').extract()
        for link in links:
            full_url = response.urljoin(link)
            # Make a Request to each detail page
            yield scrapy.Request(full_url, callback=self.parse_detail)

    def parse_detail(self, response):
        print(response.url)