我需要在广泛抓取期间抓取前10-20个内部链接,这样我就不会影响Web服务器,但是有太多域名用于" allowed_domains"。我在这里问,因为Scrapy文档没有涵盖这一点,我无法通过Google找到答案。
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors.lxmlhtml import LxmlLinkExtractor
from scrapy.item import Item, Field
class DomainLinks(Item):
links = Field()
class ScapyProject(CrawlSpider):
name = 'scapyproject'
#allowed_domains = []
start_urls = ['big domains list loaded from database']
rules = (Rule(LxmlLinkExtractor(allow=()), callback='parse_links', follow=True),)
def parse_start_url(self, response):
self.parse_links(response)
def parse_links(self, response):
item = DomainLinks()
item['links'] = []
domain = response.url.strip("http://","").strip("https://","").strip("www.").strip("ww2.").split("/")[0]
links = LxmlLinkExtractor(allow=(),deny = ()).extract_links(response)
links = [link for link in links if domain in link.url]
# Filter duplicates and append to
for link in links:
if link.url not in item['links']:
item['links'].append(link.url)
return item
以下列表理解是不使用allowed_domains列表和LxmlLinkExtractor允许过滤器来过滤链接的最佳方式,因为这两者似乎都使用正则表达式,这将影响性能并限制允许域列表的大小,如果每个报废链接是正则表达式匹配列表中的每个域?
links = [link for link in links if domain in link.url]
我正在努力解决的另一个问题是,如何在不使用allowed_domains列表的情况下让蜘蛛只关注内部链接?自定义中间件?
由于
答案 0 :(得分:1)
是的,你的列表理解是好的,也许是解决这个问题的最好方法。
links = [link for link in links if domain in link.url]
它具有以下优点:
除此之外,我建议使用urllib来提取域:Get domain name from URL
如果您只想抓取内部链接,可以通过以下方式实现此目的:
parse_links
重命名为parse
并手动创建新请求以仅关注内部链接:
def parse(self, response):
# your code ... removed for brevity ...
links = [link for link in links if domain in link.url]
# Filter duplicates and append to
for link in links:
if link.url not in item['links']:
item['links'].append(link.url)
yield Request(link)
yield item