Question

Rule(LinkExtractor(allow=rule.get("allow", None), restrict_xpaths=rule.get("restrict_xpaths", ""),deny=('guba','f10','data','fund.*?\.eastmoney\.com/\d+\.html','quote','.*so\.eastmoney.*','life','/gonggao/')),callback=rule.get("callback", ""),follow=rule.get('follow',True))

规则设置↑

运行日志：

2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of166401.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of164206.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of161823.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)```

**My settings didn't work** help

Answer 1

摘自文档：

deny（拒绝一个正则表达式（或列表））–（绝对）URL必须匹配的单个正则表达式（或正则表达式列表）才能被排除（即未提取）。它的优先级高于allow参数。如果未指定（或为空），则不会排除任何链接。

https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml

这意味着deny是应匹配 url的正则表达式列表。您的代码中定义的模式与您要抓取的网址不匹配-如果要包含某种匹配项，则需要在开头和结尾添加通配符：

 $ ptpython                                                                                                            
>>> import re                                                                                                         
>>> url = "http://guba.eastmoney.com/list,of161823.html"                                                              
>>> re.match('guba', url)                                                                                             
>>> re.match('.+guba.+', url)                                                                                         
<re.Match object; span=(0, 44), match='http://guba.eastmoney.com/list,of161823.html'>

scrapy crawlspider“拒绝”设置无效

1 个答案: