在scrapy中为每个starter_url设置allowed_domains?

时间:2015-01-22 10:01:31

标签: python scrapy web-crawler

有没有办法为每个start_url设置allowed_domains?对于start_urls中的每个网址,我想限制抓取到该网址的域名。抓取网站后,我需要从allowed_domains中删除该网域。我想一种方法是动态添加/删除url到allowed_domains?

相关问题:Crawl multiple domains with Scrapy without criss-cross

1 个答案:

答案 0 :(得分:1)

您可以尝试这样的操作,检查每个响应的spider Requests输出是否与该响应的域相同(警告:未测试):

from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached

class CrissCrossOffsiteMiddleware(object):

    def process_spider_output(self, response, result, spider):
        domainr = urlparse_cached(response.url).hostname
        for x in result:
            if isinstance(x, Request):
                if x.dont_filter:
                    yield x
                else:
                    domaino = urlparse_cached(x).hostname
                    if domaino == domainr:
                        yield x
            else:
                yield x