Question

出于某种原因，正在抓取某个移动网址，并且生成的网址在被抓取时出错。我希望scrapy只是忽略url而不是调用解析方法或其中的任何内容。

class MySpider(scrapy.Spider):

    # name, allowed_domains etc
    rules = Rule(LxmlLinkExtractor(deny=r'/m/.+') # deny http://example.com/m/anything-here.html

但这不起作用，此类链接仍在被抓取。

Answer 1

根据the docs：

deny（正则表达式（或列表）） - 单个正则表达式（或正则表达式列表），（绝对）URL必须匹配才能被排除（即未提取）。

/m/.+与http://example.com/m/anything-here.html之类的绝对网址不匹配。出于同样的原因，您需要最后.+，开头至少需要.*：

>>> print(re.match(r'/m/.+', 'http://example.com/m/anything-here.html'))
None
>>> print(re.match(r'.*/m/.+', 'http://example.com/m/anything-here.html'))
<_sre.SRE_Match object; span=(0, 39), match='http://example.com/m/anything-here.html'>

禁止某个网址被刮掉

1 个答案: