我试图拒绝本地化的URL,如下所示:
rules = (
Rule(LinkExtractor(deny=(r'\/es\/')), follow = True)
)
然而这失败了。尝试了以下其他正则表达但不是运气。
rules = (
Rule(LinkExtractor(deny=(r'\/es\/*.*')), follow = True)
)
基本上我只对该资源的英文版感兴趣。不是西班牙语版本,即:它在URL中有/es/
。
如何确保我不抓取西班牙语网址?
答案 0 :(得分:0)
像你一样在你的蜘蛛中定义你的中间件
class MySpider(scrapy.Spider):
name = "my_spider"
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'project_root_path.MyMiddlewaresFile.MyMiddleware': 300,
}
}
def start_requests(self):
yield Request()
并在MyMiddlewaresFile.py
class MyMiddleware(object):
def process_request(self, request, spider):
if "/en/" in request.url:
pass #Do not do anything.
else:
#keep processing request
return request