调整scrapy以避免特定的链接并返回url响应

时间:2018-08-23 09:21:04

标签: python web-scraping scrapy scrapy-spider

我想获取我域中的所有URL,因此使用Scrapy抓取了我的网站,效果很好(请参见下文)。 Scrapy返回我网站上的页面列表,并在每个页面上返回一组内部和外部URL,如下所示:

enter image description here

我想修改我的抓取工具以做两件事:

  1. 避免抓取以下网址(它们都映射到同一页面):

此刻,我将它们报废,然后通过拆分'?replytocom ='上的url并保留结果数组的第一个元素来删除最后一部分。我想我可以避免一开始就将它们刮掉,这会使我的刮擦变得更快

  1. 此刻我的刮板返回:

    • from_url -链接所在页面的网址
    • to_url -上一页的URL(外部和内部),但是Spider不会关注外部链接)

我还想在刮屏中返回每个to_url的状态,以便例如知道它们是否正常工作。

enter image description here

如何最好地达到1和2?预先感谢您的帮助。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from data_scraper.items import DataScraperItem


class DataSpider(CrawlSpider):
    # The name of the spider
    name = "urlcrawler"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ["mydomain.co.uk"]

    # The URLs to start with
    start_urls = ["https://mydomain.co.uk/"]

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_items"
        )
    ]

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)

    # Method for parsing items
    def parse_items(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
        # Now go through all the found links
        for link in links:
            # Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                #if allowed_domain in link.url:
                is_allowed = True
            # If it is allowed, create a new item and add it to the list of found items
            if is_allowed:
                item = DataScraperItem()
                item['url_from'] = response.url
                item['url_to'] = link.url
                items.append(item)
        # Return all the found items
        return items

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DataScraperItem(scrapy.Item):
    # The source URL
    url_from = scrapy.Field()
    # The destination URL
    url_to = scrapy.Field()

0 个答案:

没有答案