自定义解析回调请求在Scrapy

时间:2016-07-14 14:31:14

标签: python scrapy

我试图在parse_start_url方法中获取条目的URL,这会产生一个带有回调parse_link方法的请求,但回调似乎不起作用。我错了什么?

代码:

from scrapy import Request
from scrapy.selector import Selector 
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from property.items import PropertyItem
import sys

reload(sys)
sys.setdefaultencoding('utf8')  #To prevent UnicodeDecodeError, UnicodeEncodeError.

class VivastreetSpider(CrawlSpider):
    name = 'viva'
    allowed_domains = ['chennai.vivastreet.co.in']
    start_urls = ['http://chennai.vivastreet.co.in/rent+chennai/']
    rules = [
        Rule(LinkExtractor(restrict_xpaths = '//*[text()[contains(., "Next")]]'), callback = 'parse_start_url', follow = True)
        ]   

    def parse_start_url(self, response):
        urls = Selector(response).xpath('//a[contains(@id, "vs-detail-link")]/@href').extract() 

        for url in urls:
            print('test ' + url)
            yield Request(url = url, callback = self.parse_link)

    def parse_link(self, response):
        #item = PropertyItem()
        print('parseitemcalled')
        a = Selector(response).xpath('//*h1[@class = "kiwii-font-xlarge kiwii-margin-none"').extract()
        print('test ' + str(a))

1 个答案:

答案 0 :(得分:0)

您需要调整allowed_domains以允许遵循提取的网址:

allowed_domains = ['vivastreet.co.in']

然后,您将遇到无效的表达式错误,这是因为//*h1[@class = "kiwii-font-xlarge kiwii-margin-none"无效且需要修复。