如何评估提取的链接是否为子路径

时间:2016-06-22 11:36:37

标签: python path scrapy web-crawler scrapy-spider

我正在使用scrapy抓取一些页面。我正在使用python 2.7。 蜘蛛返回响应对象,我正在检查页面上找到的URL。我想限制蜘蛛只遵循作为我指定位置的子路径的URL。

例如,我想指明蜘蛛应该只遵循以下链接www.google.com/policies/privacy/

从响应对象中提取的链接遵循许多不同的约定。

E.g。

我无法理解如何做到这一点。我刚刚在字符串上使用了一个简单的find方法。它看起来并不强大或者看起来很聪明。

import scrapy

class googleSpider(scrapy.Spider):
    name = "google"
    allowed_domains = ["google.co.uk"]
    start_urls = [
        "http://www.google.co.uk/intl/en/policies/privacy/"
    ]

    def parse(self, response):
        for href in response.xpath('//a/@href').extract():
            if href.find('/policies/privacy/') != -1:
                yield scrapy.Request(response.urljoin(href), callback=self.parse_dir_contents)

    def parse_dir_contents(self, response):
        pass

1 个答案:

答案 0 :(得分:0)

您可以使用source code。默认值标准化链接。

然后,从Requests

获取的链接构建.extract_links(response)是一个问题

检查此scrapy shell示例:

$ scrapy shell https://www.google.com/policies/privacy/
2016-06-22 18:03:19 [scrapy] INFO: Scrapy 1.1.0 started (bot: scrapybot)
(...edited...)
2016-06-22 18:03:20 [scrapy] DEBUG: Crawled (200) <GET https://www.google.com/policies/privacy/> (referer: None)
(...edited...)

>>> from scrapy.linkextractors import LinkExtractor
>>> for l in LinkExtractor().extract_links(response):
...     print(l.url)
... 
https://www.google.com/
(...edited...)
https://support.google.com/accounts/answer/32046?hl=en
https://www.google.com/trends/
https://www.youtube.com/trendsmap
https://privacy.google.com/?hl=en
https://www.google.com/policies/technologies/location-data/
https://www.google.com/policies/technologies/wallet/
https://www.google.com/policies/technologies/voice/
https://www.google.com/safetycenter/families/start/
https://www.google.com/intl/en/about/
https://www.google.com/intl/en/policies/privacy/
https://www.google.com/intl/en/policies/terms/

>>> for l in LinkExtractor().extract_links(response):
...     if response.url in l.url:
...         print(l.url)
... 
https://www.google.com/policies/privacy/
https://www.google.com/policies/privacy/frameworks/
https://www.google.com/policies/privacy/key-terms/
https://www.google.com/policies/privacy/partners/
https://www.google.com/policies/privacy/archive/
https://www.google.com/policies/privacy/example/more-relevant-search-results.html
https://www.google.com/policies/privacy/example/connect-with-people.html
https://www.google.com/policies/privacy/example/sharing-with-others.html
https://www.google.com/policies/privacy/example/ads-youll-find-most-useful.html
https://www.google.com/policies/privacy/example/the-people-who-matter-most.html
https://www.google.com/policies/privacy/example/credit-card.html
https://www.google.com/policies/privacy/example/collect-information.html
https://www.google.com/policies/privacy/example/view-and-interact-with-our-ads.html
https://www.google.com/policies/privacy/example/device-specific-information.html
https://www.google.com/policies/privacy/example/device-identifiers.html
https://www.google.com/policies/privacy/example/phone-number.html
https://www.google.com/policies/privacy/example/may-collect-and-process-information.html
https://www.google.com/policies/privacy/example/sensors.html
https://www.google.com/policies/privacy/example/wifi-access-points-and-cell-towers.html
https://www.google.com/policies/privacy/example/our-partners.html
https://www.google.com/policies/privacy/example/advertising-services.html
https://www.google.com/policies/privacy/example/linked-with-information-about-visits-to-multiple-sites.html
https://www.google.com/policies/privacy/example/provide-services.html
https://www.google.com/policies/privacy/example/maintain-services.html
https://www.google.com/policies/privacy/example/protect-services.html
https://www.google.com/policies/privacy/example/develop-new-ones.html
https://www.google.com/policies/privacy/example/protect-google-and-our-users.html
https://www.google.com/policies/privacy/example/limit-sharing-or-visibility-settings.html
https://www.google.com/policies/privacy/example/improve-your-user-experience.html
https://www.google.com/policies/privacy/example/combine-personal-information.html
https://www.google.com/policies/privacy/example/to-make-it-easier-to-share.html
https://www.google.com/policies/privacy/example/may-not-function-properly.html
https://www.google.com/policies/privacy/example/sharing.html
https://www.google.com/policies/privacy/example/removing-your-content.html
https://www.google.com/policies/privacy/example/access-to-your-personal-information.html
https://www.google.com/policies/privacy/example/legal-process.html
https://www.google.com/policies/privacy/example/we-may-share.html
https://www.google.com/policies/privacy/example/to-show-trends.html