我正在提交FormRequest
来更改多页结果的页码。
当我使用scrapy shell时,Post
请求会通过:
> `2017-05-21 22:44:19 [scrapy.core.engine] INFO: Spider opened
> 2017-05-21 22:44:20 [scrapy.core.engine] DEBUG: Crawled (200) <GET
> http://www.australianschoolsdirectory.com.au/robots.txt> (referer:
> None) 2017-05-21 22:44:22 [scrapy.core.engine] DEBUG: Crawled (200)
> <POST http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:27 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:39 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:43 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True 2017-05-21 22:44:46 [scrapy.core.engine] DEBUG:
> Crawled (200) <POST
> http://www.australianschoolsdirectory.com.au/search-result.php>
> (referer: None) True`
使用此请求序列:
>>> from scrapy.http import FormRequest
>>> url = 'http://www.australianschoolsdirectory.com.au/search-result.php'
>>> for i in range(1, 6):
... payload={'pageNum': str(i)}
... r = FormRequest(url, formdata=payload)
... fetch(r)
... view(response)
但是当我将post
请求实施到我的scrapy代码中时,post
会被引回到初始搜索网站。
`2017-05-21 22:58:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.australianschoolsdirectory.com.au/robots.txt> (referer: None)
2017-05-21 22:58:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.australianschoolsdirectory.com.au/search-result.php> (referer: None)
2017-05-21 22:58:46 [scrapy.core.engine] DEBUG: Crawled (200) <POST http://www.australianschoolsdirectory.com.au/**search.php>** (referer: http://www.australianschoolsdirectory.com.au/search-result.php)`
当然search.php
没有我正在寻找的数据。为什么我的代码中的Post
将其引用回搜索而不是shell?如何在仍然进入下一组结果的同时停止推荐?
Scrapy代码:
from scrapy.http import FormRequest
from scrapy.spiders import Spider
class Foo(Spider):
name = "schoolsTest"
allowed_domains = ["australianschoolsdirectory.com.au"]
start_urls = ["http://www.australianschoolsdirectory.com.au/search-result.php"]
def parse(self, response):
yield FormRequest.from_response(response, formdata={'pageNum': str(5), 'search': 'true'}, callback=self.parse1)
def parse1(self, response):
print response.url
答案 0 :(得分:1)
首先,您不需要使用from_response
(因为您不处理表单),您可以使用scrapy start_requests
方法:
import scrapy
class Foo(scrapy.Spider):
name = "schoolsTest"
def start_requests(self):
url = "http://www.australianschoolsdirectory.com.au/search-result.php"
# Change 5 to 488 to parse all search result
for i in range(1, 5):
payload = {'pageNum': str(i)}
yield scrapy.FormRequest(url, formdata=payload)
def parse(self, response):
# Extract all links from search page and make absolute urls
links = response.xpath('//div[@class="listing-header"]/a/@href').extract()
for link in links:
full_url = response.urljoin(link)
# Make a Request to each detail page
yield scrapy.Request(full_url, callback=self.parse_detail)
def parse_detail(self, response):
print(response.url)