我尝试用CrawlSpider来做这个代码,但是蜘蛛没有返回结果(之后打开和关闭):
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from torent.items import TorentItem
class MultiPagesSpider(CrawlSpider):
name = 'job'
allowed_domains = ['tanitjobs.com/']
start_urls = ['http://tanitjobs.com/browse-by-category/Nurse/?searchId=1393459812.065&action=search&page=1&view=list',]
rules = (
Rule (SgmlLinkExtractor(allow=('page=*',),restrict_xpaths=('//div[@class="pageNavigation"]',))
, callback='parse_item', follow= True),
)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items= hxs.select('//div[@class="offre"]/div[@class="detail"]')
scraped_items =[]
for item in items:
scraped_item = TorentItem()
scraped_item["title"] = item.select('a/strong/text()').extract()
scraped_items.append(scraped_item)
return items
答案 0 :(得分:0)
@paul t。在上面的评论中说,但另外您需要返回scraped_items
而不是items
,否则您将收到大量错误,如下所示:
2014-02-26 23:40:59+0000 [job] ERROR: Spider must return Request, BaseItem or None, got 'HtmlXPathSelector' in
<GET http://tanitjobs.com/browse-by-category/Nurse/?action=search&page=3&searchId=1393459812.065&view=list>