我一直在使用Scrapy,并尝试按照示例操作,只关注匹配某种正则表达式的url。
我不是Python开发人员,但我尝试了很多技术来尝试实现这一目标。
我在Scrapy文档中使用示例网址,并通过CrawlSpider
从LinkExtractor
和implmentingg规则扩展。
目前,我想对其中包含“朋友”一词的任何网址使用自定义解析器。
** Scrapy Python Spider **
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com']
rules = [
Rule(LinkExtractor(allow='(friends)'), callback='parse_custom')
]
def parse(self, response):
self.logger.info('1111111111111 - Parsing General URL! %s', response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, callback=self.parse)
def parse_custom(self, response):
# I have never been able to get this to call
self.logger.info('2222222222222 - Parsing CUSTOM URL! %s', response.url)
for href in response.css('a::attr(href)'):
yield response.follow(href, callback=self.parse)
日志文件
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/miracles/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/miracle/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/live/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/life/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/inspirational/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/choices/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/abilities/page/1/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/simile/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/miracles/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/miracle/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/live/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/life/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/inspirational/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/choices/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/abilities/page/1/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/simile/
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/truth/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Marilyn-Monroe/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/friends/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/friendship/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/reading/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/books/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/tag/humor/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/author/Jane-Austen/> (referer: http://quotes.toscrape.com)
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/truth/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/author/Marilyn-Monroe/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/friends/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/friendship/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/reading/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/books/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/tag/humor/
2017-07-30 10:45:59 [example] INFO: 1111111111111 - Parsing General URL! http://quotes.toscrape.com/author/Jane-Austen/
答案 0 :(得分:2)
编写爬网蜘蛛规则时,请避免使用parse
作为回调,因为CrawlSpider
使用parse
方法本身来实现其逻辑。因此,如果您覆盖parse
方法,则抓取蜘蛛将不再有效。