Scrapy规则:使用流程链接排除某些网址

时间:2019-06-26 14:57:19

标签: python-3.x web-scraping scrapy

我很高兴发现Scrapy Crawl类及其规则对象。但是,当我尝试使用process_links提取包含“ login”一词的网址时,它不起作用。我实现的解决方案来自此处:Example code for Scrapy process_links and process_request,但不排除我想要的页面

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from accenture.items import AccentureItem

class AccentureSpiderSpider(CrawlSpider):
    name = 'accenture_spider'
    start_urls = ['https://www.accenture.com/us-en/internet-of-things-index']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//a[contains(@href, "insight")]'), callback='parse_item',process_links='process_links', follow=True),
    ) 

    def process_links(self, links):
        for link in links:
            if 'login' in link.text:
                continue  # skip all links that have "login" in their text
            yield link 

    def parse_item(self, response):
        loader = ItemLoader(item=AccentureItem(), response=response)
        url = response.url
        loader.add_value('url', url)
        yield loader.load_item()

1 个答案:

答案 0 :(得分:1)

我的错误是使用link.text 使用link.url时,效果很好:)