我是Scrapy和Python的新手。
我想抓一个使用基于查询的搜索的房地产注册网站。我见过的大多数示例都使用简单的网页,而不是通过FormRequest机制进行搜索。我写的代码如下。目前所有东西都是硬编码的。我希望能够抓取年份或县的数据库。
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class SecondSpider(CrawlSpider):
name = "second"
'''
def start_requests(self):
return [scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm"# this is the form here it asks for the following,
# then the linke changes to this form
https://www.propertypriceregister.ie/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView
&Start=1
&SearchMax=0
&SearchOrder=4
&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011
&County= # this are the fields of query
&Year=2010 # this are the fields of query
&StartMonth= # this are the fields of query
&EndMonth= # this are the fields of query
&Address= # this are the fields of query
formdata={'user': 'john', 'pass': 'secret'},
callback=self.logged_in)]
def logged_in(self, response):
# here you would extract links to follow and return Requests for
# each of them, with another callback
pass
'''
allowed_domains = ["www.propertypriceregister.ie"]
start_urls = ('https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm',)
rules = (
Rule(SgmlLinkExtractor(allow='/website/npsra/PPR/npsra-ppr.nsf/PPR-By-Date?SearchView&Start=1&SearchMax=0&SearchOrder=4&Query=%5Bdt_execution_date%5D%3E=01/01/2010%20AND%20%5Bdt_execution_date%5D%3C01/01/2011&County=&Year=2010&StartMonth=&EndMonth=&Address='),
callback='parse',
follow= True),
)
def parse(self, response):
print response
pass
答案 0 :(得分:1)
在开始之前,请重新阅读Rule
对象的工作原理。目前,您的规则将匹配一个非常具体的网址,该网站永远不会显示该网站的链接(因为它采用表单帖子的格式)。
接下来,不要覆盖CrawlSpider的parse
功能(实际上,根本不要使用它)。它由CrawlSpider在内部使用(请参阅我提供的链接上的警告以获取更多详细信息)。
您需要为每个要调用的元素生成FormRequest
,类似于这样的内容(注意:未经测试,但它应该可以正常工作):
import itertools
... # all your other imports here
class SecondSpider(CrawlSpider):
name = 'second'
allowed_domains = ['propertypriceregister.ie', 'www.propertypriceregister.ie']
rules = (
Rule(LinkExtractor(allow=("/eStampUNID/UNID-")), callback='parse_search'),
)
def start_requests(self):
years = [2010, 2011, 2012, 2013, 2014]
counties = ['County1', 'County2')
for county, year in itertools.product(*[counties, years]):
yield scrapy.FormRequest("https://www.propertypriceregister.ie/website/npsra/pprweb.nsf/PPR?OpenForm",
formdata={'County': county, 'Year': year},
dont_filter=True)
def parse_search(self, response):
# Parse response here
从这一点开始,您的规则将应用于您从FormRequest返回的每个页面,以从中提取URL。如果您想从这些初始网址中抓取任何内容,请覆盖CrawlSpider的parse_start_url
方法。