我想从网站scrapy
“点击”的链接中收集文字。
考虑以下示例:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DnsDbSpider(CrawlSpider):
name = 'dns_db'
allowed_domains = ['www.iana.org']
start_urls = ['http://www.iana.org/']
rules = (
Rule(LinkExtractor(
allow_domains='www.iana.org',
restrict_css=r'#home-panel-domains > h2'),
callback='parse_item',
follow=True),
Rule(LinkExtractor(
allow_domains='www.iana.org',
restrict_css=r'#main_right > p:nth-child(3)'),
callback='parse_item',
follow=True),
Rule(LinkExtractor(
allow_domains='www.iana.org',
restrict_css=r'#main_right > ul:nth-child(4) > li'),
callback='parse_item',
follow=True),
)
def parse_item(self, response):
self.logger.info('## Parsing URL: %s', response.url)
i = {}
return i
scrapy
日志:
$ scrapy crawl dns_db 2>&1 | grep 'Parsing URL'
2017-01-17 22:14:01 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root/db
在这种情况下,scrapy
执行了以下操作:
path = []
path = ['Domain Names']
path = ['Domain Names', 'The DNS Root Zone']
path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
只需查看此路径/列表,人类就可以在网站中导航。
我怎样才能实现这一目标?
修改
这是一个有效的例子:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
class DnsDbSpider(scrapy.Spider):
name = "dns_db"
allowed_domains = ["www.iana.org"]
start_urls = ['http://www.iana.org/']
def parse(self, response):
if 'req_path' not in response.meta:
response.meta['req_path'] = []
self.logger.warn('## Request path: %s', response.meta['req_path'])
restrict_css = (
r'#home-panel-domains > h2',
r'#main_right > p:nth-child(3)',
r'#main_right > ul:nth-child(4) > li'
)
links = [link for css in restrict_css for link in self.links(response, css)]
for link in links:
#self.logger.info('## Link: %s', link)
request = scrapy.Request(
url=link.url,
callback=self.parse)
request.meta['req_path'] = response.meta['req_path'].copy()
request.meta['req_path'].append(dict(text=link.text, url=link.url))
yield request
def links(self, response, restrict_css=None):
lex = LinkExtractor(
allow_domains=self.allowed_domains,
restrict_css=restrict_css)
return lex.extract_links(response)
命令行输出:
$ scrapy crawl -L WARN dns_db
2017-02-12 00:13:50 [dns_db] WARNING: ## Request path: []
2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}]
2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}]
2017-02-12 00:13:52 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}, {'text': 'Root Zone Database', 'url': 'http://www.iana.org/domains/root/db/'}]
答案 0 :(得分:0)
您可以随身携带您的网址文字并继续合并,直至到达您想要的网页并将其全部合并:
from scrapy import Spider, Request
class MySpider(Spider):
name = 'iana'
start_urls = ['http://iana.org']
link_extractors = [LinkExtract()]
def parse(self, response):
path = response.meta.get('path', []) # retrieve the path we have so far or set default
links = [l.extract_links(response) for l in self.link_extractors]
for l in links:
url = l.url
current_path = [l.text]
yield Request(url, self.parse,
meta={'path': path + current_path})
# now when we reach the last page that we want,
# we return an item with all gathered path parts
last_page = True # some condition to determine that it's last page, e.g. no links found
if last_page:
item = dict()
item['path'] = ' > '.join(path)
# e.g. 'Domain Names > The DNS Root Zone > Root Zone Database'
return item
此蜘蛛将继续抓取网址,每次保存网址文本meta['path']
,当满足某些条件时,它将返回一个包含目前遇到的所有路径值的项目。