从链接中收集文字​​“点击”?

时间:2017-01-17 22:33:28

标签: python web-scraping scrapy

我想从网站scrapy“点击”的链接中收集文字​​。

考虑以下示例:

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class DnsDbSpider(CrawlSpider):
    name = 'dns_db'
    allowed_domains = ['www.iana.org']
    start_urls = ['http://www.iana.org/']

    rules = (
        Rule(LinkExtractor(
            allow_domains='www.iana.org',
            restrict_css=r'#home-panel-domains > h2'),
            callback='parse_item',
            follow=True),
        Rule(LinkExtractor(
            allow_domains='www.iana.org',
            restrict_css=r'#main_right > p:nth-child(3)'),
            callback='parse_item',
            follow=True),
        Rule(LinkExtractor(
            allow_domains='www.iana.org',
            restrict_css=r'#main_right > ul:nth-child(4) > li'),
            callback='parse_item',
            follow=True),
    )


    def parse_item(self, response):
        self.logger.info('## Parsing URL: %s', response.url)
        i = {}
        return i

scrapy日志:

$ scrapy crawl dns_db 2>&1 | grep 'Parsing URL'
2017-01-17 22:14:01 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root
2017-01-17 22:14:02 [dns_db] INFO: ## Parsing URL: http://www.iana.org/domains/root/db

在这种情况下,scrapy执行了以下操作:

  1. 打开“ www.iana.org
    path = []
  2. 点击“域名”网址 path = ['Domain Names']
  3. 在“域名”页面中,点击“ DNS根区域”网址。
    path = ['Domain Names', 'The DNS Root Zone']
  4. 在“ DNS根区域”页面中,单击“根区域数据库”网址。
    path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
  5. 在“根区域数据库”页面中,我将开始报废数据,从而创建项目。最终项目也有路径属性:
    path = ['Domain Names', 'The DNS Root Zone', 'Root Zone Database']
  6. 只需查看此路径/列表,人类就可以在网站中导航。

    我怎样才能实现这一目标?

    修改

    这是一个有效的例子:

    # -*- coding: utf-8 -*-
    import scrapy
    from scrapy.linkextractors import LinkExtractor
    
    
    class DnsDbSpider(scrapy.Spider):
        name = "dns_db"
        allowed_domains = ["www.iana.org"]
        start_urls = ['http://www.iana.org/']
    
        def parse(self, response):
            if 'req_path' not in response.meta:
                response.meta['req_path'] = []
            self.logger.warn('## Request path: %s', response.meta['req_path'])
            restrict_css = (
                r'#home-panel-domains > h2',
                r'#main_right > p:nth-child(3)',
                r'#main_right > ul:nth-child(4) > li'
            )
            links = [link for css in restrict_css for link in self.links(response, css)]
            for link in links:
                #self.logger.info('## Link: %s', link)
                request = scrapy.Request(
                    url=link.url,
                    callback=self.parse)
                request.meta['req_path'] = response.meta['req_path'].copy()
                request.meta['req_path'].append(dict(text=link.text, url=link.url))
                yield request
    
        def links(self, response, restrict_css=None):
            lex = LinkExtractor(
                allow_domains=self.allowed_domains,
                restrict_css=restrict_css)
            return lex.extract_links(response)
    

    命令行输出:

    $ scrapy crawl -L WARN dns_db
    2017-02-12 00:13:50 [dns_db] WARNING: ## Request path: []
    2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}]
    2017-02-12 00:13:51 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}]
    2017-02-12 00:13:52 [dns_db] WARNING: ## Request path: [{'text': 'Domain Names', 'url': 'http://www.iana.org/domains'}, {'text': 'The DNS Root Zone', 'url': 'http://www.iana.org/domains/root'}, {'text': 'Root Zone Database', 'url': 'http://www.iana.org/domains/root/db/'}]
    

1 个答案:

答案 0 :(得分:0)

您可以随身携带您的网址文字并继续合并,直至到达您想要的网页并将其全部合并:

from scrapy import Spider, Request

class MySpider(Spider):
    name = 'iana'
    start_urls = ['http://iana.org']
    link_extractors = [LinkExtract()]

    def parse(self, response):
        path = response.meta.get('path', [])  # retrieve the path we have so far or set default
        links = [l.extract_links(response) for l in self.link_extractors]
        for l in links:
            url = l.url
            current_path = [l.text]
            yield Request(url, self.parse, 
                          meta={'path': path + current_path})
        # now when we reach the last page that we want, 
        # we return an item with all gathered path parts
        last_page = True  # some condition to determine that it's last page, e.g. no links found
        if last_page:
            item = dict()
            item['path'] = ' > '.join(path)
            # e.g. 'Domain Names > The DNS Root Zone > Root Zone Database'
            return item

此蜘蛛将继续抓取网址,每次保存网址文本meta['path'],当满足某些条件时,它将返回一个包含目前遇到的所有路径值的项目。