我想使用Scrapy从带有表格和多个页面的网站中检索一些数据。这是看起来像:
class ItsyBitsy(Spider):
name = "itsybitsy"
allowed_domains = ["mywebsite.com"]
start_urls = ["http://mywebsite.com/Default.aspx"]
def parse(self, response):
# Performs authentication to get past the login form
return [FormRequest.from_response(response,
formdata={'tb_Username':'admin','tb_Password':'password'},
callback=self.after_login,
clickdata={'id':'b_Login'})]
def after_login(self, response):
# Session authenticated. Request the Subscriber List page
yield Request("http://mywebsite.com/List.aspx",
callback=self.listpage)
def listpage(self, response):
# Parses the entries on the page, and stores them
sel = Selector(response)
entries = sel.xpath("//table[@id='gv_Subsribers']").css("tr")
items = []
for entry in entries:
item = Contact()
item['name'] = entry.xpath('/td[0]/text()')
items.append(item)
# I want to request the next page, but store these results FIRST
self.getNext10()
return items
我在最后一行停留在最后一行。我想请求下一页(这样我可以提取另外10行数据),但我想使用feed导出器(在我的settings.py
中配置)保存数据。
如何在不调用return items
的情况下告诉Feed导出器保存数据(这会阻止我继续刮掉接下来的10行)。
答案 0 :(得分:1)
答案:使用发电机。
def listpage(self, response):
# Parses the entries on the page, and stores them
sel = Selector(response)
entries = sel.xpath("//table[@id='gv_Subsribers']").css("tr")
items = []
for entry in entries:
item = Contact()
item['name'] = entry.xpath('/td[0]/text()')
yield item
# remember get next has to return Request with callback=self.listpage
yield self.getNext10()