Question

我有一个带有start_urls数组的CrawlerSpider：

    start_urls=[
            'http://www.tottus.cl/tottus/productListFragment/Conservas/118.8?No=0&Nrpp=&currentCatId=118.8',
            'http://www.tottus.cl/tottus/productListFragment/Alimento-Gato/9.2?No=0&Nrpp=&currentCatId=9.2'
        ]

我的抓取规则

rules = (      
    Rule(LinkExtractor(allow=(),restrict_xpaths=('//a[@id="next"]')),follow=True),
    Rule(LinkExtractor(allow=(),restrict_xpaths=('//div[@class="title"]/a')),callback='parse_item',process_links='process_value'),
)

我有一个procces_links函数，该函数接收一个值（子代url），并进行X处理并返回它：

    def process_value(self,value):
    for link in value:
        #whatever procces
        link.url = link.url+'&categoria=hola'
        yield link

所有这些都包含start_urls的子URL。 start_urls数组（父网址），Rule中的网址（子级网址）

我的问题是，我需要将父URL（start_url）发送到procces_value函数。我需要这个：

def process_value(self,value):
    print(parent_url) #start_url . how i pass this in the Craw Rule?
    for link in value:
        print(link) #children url

如何将父URL（start_url）传递给规则中的函数？

抓取抓取规则发送网址

0 个答案: