如何使用Scrapy将objets从一个规则发送到另一条规则

时间:2019-12-23 17:16:46

标签: python scrapy rules

我正试图取消Glassdoor公司的评级,在某个时候,我需要将一些对象从一条规则发送到另一条规则。

这是搜索的主要链接:https://www.glassdoor.com/Reviews/lisbon-reviews-SRCH_IL.0,6_IM1121.htm

我在第一个规则上访问此页面,获取一些信息,然后我需要转到该页面的另一个链接,以进入XPath表达式之后的评论页面// a [@ class ='eiCell cell reviews' ]。

这是问题所在,如何在parse_item中使用XPath表达式跟踪此链接,而又不会丢失我得到的信息?

class GetComentsSpider(CrawlSpider):
name = 'get_coments'
allowed_domains = ['www.glassdoor.com']
start_urls = ['http://https://www.glassdoor.com/Reviews/portugal-reviews-SRCH_IL.0,8_IN195.htm/']
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'
download_delay = 0.1


rules = (
    #Acess the page, get the link from each company and move to parse_item
    Rule(LinkExtractor(restrict_xpaths="//div[@class=' margBotXs']/a"), callback='parse_item', follow=True),
    Rule(LinkExtractor(restrict_xpaths="//a[@class='eiCell cell reviews ']"), callback='parse_item', follow=True),

    #Pagination
    Rule(LinkExtractor(restrict_xpaths="//li[@class='next']/a"), follow=True),


)



def parse_item(self, response):
    #get company name and rating
    name = response.xpath("(//span[@class='updateBy'])[1]").get()
    rating = response.xpath("//span[@class='bigRating strong margRtSm h1']/text()").get()

    #Here i need to go to the link of //a[@class='eiCell cell reviews '] to get more data
    #without losing the name and rating

    yield {
        "Name" : name,
        "Rating" : rating
        }

1 个答案:

答案 0 :(得分:0)

您可以使用Request(..., meta=...)

发送给其他解析器

(而且您不需要Rule即可获得此请求的网址)

def parse_item(self, response):
    name = response.xpath("(//span[@class='updateBy'])[1]").get()
    rating = response.xpath("//span[@class='bigRating strong margRtSm h1']/text()").get()

    item = {
        "Name" : name,
        "Rating" : rating
    }

    url = ... #Here i need to go to the link of //a[@class='eiCell cell reviews '] to get more data

    yield Request(url, callback='other_parser', meta={"item": item})

def other_parser(self, response):  
    item = response.meta['item']

    item['other'] = ... # add values to item 

    yield item