Question

我必须抓取5-6个域名。我想写一个爬虫，因为offsite请求包含一些子串示例设置为[aaa，bbb，ccc] 如果offsite url包含来自上面的set的子字符串，那么它应该被处理而不是过滤掉。我应该编写自定义中间件，还是只在允许的域中使用正则表达式。

Answer 1

非现场中间件默认使用正则表达式，但它没有暴露。它会将您提供的域编译为正则表达式，但域名将被转义，因此在video.currentTime = 0; // sometime it will not work when videos are too long.中提供正则表达式代码将无效。

您可以做的是扩展该中间件并覆盖allowed_domains方法以实施您自己的异地策略。

get_host_regex()中的原始代码：

scrapy.spidermiddlewares.offsite.OffsiteMiddleware

您可以覆盖以返回自己的正则表达式：

def get_host_regex(self, spider):
    """Override this method to implement a different offsite policy"""
    allowed_domains = getattr(spider, 'allowed_domains', None)
    if not allowed_domains:
        return re.compile('') # allow all by default
    regex = r'^(.*\.)?(%s)$' % '|'.join(re.escape(d) for d in allowed_domains if d is not None)
    return re.compile(regex)

Scrapy - 基于正则表达式处理的异地请求

1 个答案: