我想获取我域中的所有URL,因此使用Scrapy抓取了我的网站,效果很好(请参见下文)。 Scrapy返回我网站上的页面列表,并在每个页面上返回一组内部和外部URL,如下所示:
我想修改我的抓取工具以做两件事:
避免抓取以下网址(它们都映射到同一页面):
此刻,我将它们报废,然后通过拆分'?replytocom ='上的url并保留结果数组的第一个元素来删除最后一部分。我想我可以避免一开始就将它们刮掉,这会使我的刮擦变得更快
此刻我的刮板返回:
我还想在刮屏中返回每个to_url的状态,以便例如知道它们是否正常工作。
如何最好地达到1和2?预先感谢您的帮助。
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider
from data_scraper.items import DataScraperItem
class DataSpider(CrawlSpider):
# The name of the spider
name = "urlcrawler"
# The domains that are allowed (links to other domains are skipped)
allowed_domains = ["mydomain.co.uk"]
# The URLs to start with
start_urls = ["https://mydomain.co.uk/"]
# This spider has one rule: extract all (unique and canonicalized) links, follow them and parse them using the parse_items method
rules = [
Rule(
LinkExtractor(
canonicalize=True,
unique=True
),
follow=True,
callback="parse_items"
)
]
# Method which starts the requests by visiting all URLs specified in start_urls
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, callback=self.parse, dont_filter=True)
# Method for parsing items
def parse_items(self, response):
# The list of items that are found on the particular page
items = []
# Only extract canonicalized and unique links (with respect to the current page)
links = LinkExtractor(canonicalize=True, unique=True).extract_links(response)
# Now go through all the found links
for link in links:
# Check whether the domain of the URL of the link is allowed; so whether it is in one of the allowed domains
is_allowed = False
for allowed_domain in self.allowed_domains:
#if allowed_domain in link.url:
is_allowed = True
# If it is allowed, create a new item and add it to the list of found items
if is_allowed:
item = DataScraperItem()
item['url_from'] = response.url
item['url_to'] = link.url
items.append(item)
# Return all the found items
return items
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class DataScraperItem(scrapy.Item):
# The source URL
url_from = scrapy.Field()
# The destination URL
url_to = scrapy.Field()