我正在抓取一个有?locale = en或locale = jp ...
的网站我只对抓取网址中未指定区域设置的网站感兴趣。
目前我有这个:
# More specific ones at the top please
# In general, deny all locale specified links
rules = (
# Matches looks
# http://lookbook.nu/look/4273137-Galla-Spectrum-Yellow
Rule(SgmlLinkExtractor(allow=('/look/\d+'), deny=('\?locale=')), callback='parse_look'),
# Matches all looks page under user overview,
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/looks/?$'), deny=('\?locale=')),
callback='parse_model_looks'),
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]+/looks\?page=\d+$'), deny=('\?locale=')),
callback='parse_model_looks'),
# Matches all user overview pages
Rule(SgmlLinkExtractor(allow=('/user/\d+[^/]*/?$'), deny=('\?locale=')),
callback='parse_model_overview'),
我在各处重复否认。
还有更好的方法吗?
我尝试做一般规则拒绝所有\?locale =但是这不起作用。
答案 0 :(得分:2)
你可以构建一个复杂的“允许”正则表达式,但写正则表达式通常很痛苦。 您还可以使用:process_links方法,如下所述: https://scrapy.readthedocs.org/en/latest/topics/spiders.html?highlight=process_links
这将打开调用url解析器并分析参数的可能性:
Rule(SgmlLinkExtractor(allow=('/look/\d+')),
process_links='process_links',
callback='parse_look')
def process_links(self,links):
return [link for link in links if self.valid_links(link))
def valid_links(self,link):
import urlparse
urlp=urlparse.urlparse(link.url)
querydict=urlparse.parse_qs(urlp.query)
return "locale" not in querydict
这是一种检查参数的更安全的技术