有没有办法为每个start_url设置allowed_domains?对于start_urls中的每个网址,我想限制抓取到该网址的域名。抓取网站后,我需要从allowed_domains中删除该网域。我想一种方法是动态添加/删除url到allowed_domains?
答案 0 :(得分:1)
您可以尝试这样的操作,检查每个响应的spider Requests
输出是否与该响应的域相同(警告:未测试):
from scrapy.http import Request
from scrapy.utils.httpobj import urlparse_cached
class CrissCrossOffsiteMiddleware(object):
def process_spider_output(self, response, result, spider):
domainr = urlparse_cached(response.url).hostname
for x in result:
if isinstance(x, Request):
if x.dont_filter:
yield x
else:
domaino = urlparse_cached(x).hostname
if domaino == domainr:
yield x
else:
yield x