我很高兴发现Scrapy Crawl类及其规则对象。但是,当我尝试使用process_links提取包含“ login”一词的网址时,它不起作用。我实现的解决方案来自此处:Example code for Scrapy process_links and process_request,但不排除我想要的页面
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from accenture.items import AccentureItem
class AccentureSpiderSpider(CrawlSpider):
name = 'accenture_spider'
start_urls = ['https://www.accenture.com/us-en/internet-of-things-index']
rules = (
Rule(LinkExtractor(restrict_xpaths='//a[contains(@href, "insight")]'), callback='parse_item',process_links='process_links', follow=True),
)
def process_links(self, links):
for link in links:
if 'login' in link.text:
continue # skip all links that have "login" in their text
yield link
def parse_item(self, response):
loader = ItemLoader(item=AccentureItem(), response=response)
url = response.url
loader.add_value('url', url)
yield loader.load_item()
答案 0 :(得分:1)
我的错误是使用link.text 使用link.url时,效果很好:)