Question

我想获取我域中的所有URL，因此使用Scrapy抓取了我的网站，效果很好（请参见下文）。 Scrapy返回我网站上的页面列表，并在每个页面上返回一组内部和外部URL，如下所示：

我想修改我的抓取工具以做两件事：

避免抓取以下网址（它们都映射到同一页面）：

此刻，我将它们报废，然后通过拆分'？replytocom ='上的url并保留结果数组的第一个元素来删除最后一部分。我想我可以避免一开始就将它们刮掉，这会使我的刮擦变得更快

此刻我的刮板返回：
- from_url -链接所在页面的网址
- to_url -上一页的URL（外部和内部），但是Spider不会关注外部链接）

我还想在刮屏中返回每个to_url的状态，以便例如知道它们是否正常工作。

如何最好地达到1和2？预先感谢您的帮助。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from data_scraper.items import DataScraperItem


class DataSpider(CrawlSpider):
    # The name of the spider
    name = "urlcrawler"

    # The domains that are allowed (links to other domains are skipped)
    allowed_domains = ["mydomain.co.uk"]

    # The URLs to start with
    start_urls = ["https://mydomain.co.uk/"]

    # This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
    rules = [
        Rule(
            LinkExtractor(
                canonicalize=True,
                unique=True
            ),
            follow=True,
            callback="parse_items"
        )
    ]

    # Method which starts the requests by visiting all URLs specified in start_urls
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse, dont_filter=True)

    # Method for parsing items
    def parse_items(self, response):
        # The list of items that are found on the particular page
        items = []
        # Only extract canonicalized and unique links (with respect to the current page)
        links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
        # Now go through all the found links
        for link in links:
            # Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
            is_allowed = False
            for allowed_domain in self.allowed_domains:
                #if allowed_domain in link.url:
                is_allowed = True
            # If it is allowed, create a new item and add it to the list of found items
            if is_allowed:
                item = DataScraperItem()
                item['url_from'] = response.url
                item['url_to'] = link.url
                items.append(item)
        # Return all the found items
        return items

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class DataScraperItem(scrapy.Item):
    # The source URL
    url_from = scrapy.Field()
    # The destination URL
    url_to = scrapy.Field()

调整scrapy以避免特定的链接并返回url响应

0 个答案: