scrapy crawlspider“拒绝”设置无效

时间:2019-06-27 02:44:27

标签: python scrapy

Rule(LinkExtractor(allow=rule.get("allow", None), restrict_xpaths=rule.get("restrict_xpaths", ""),deny=('guba','f10','data','fund.*?\.eastmoney\.com/\d+\.html','quote','.*so\.eastmoney.*','life','/gonggao/')),callback=rule.get("callback", ""),follow=rule.get('follow',True))

规则设置↑

运行日志:

2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of166401.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of164206.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of161823.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)```

**My settings didn't work** help

1 个答案:

答案 0 :(得分:0)

摘自文档:

  

deny(拒绝一个正则表达式(或列表))–(绝对)URL必须匹配的单个正则表达式(或正则表达式列表)才能被排除(即未提取)。它的优先级高于allow参数。如果未指定(或为空),则不会排除任何链接。

https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml

这意味着deny是应匹配 url的正则表达式列表。您的代码中定义的模式与您要抓取的网址不匹配-如果要包含某种匹配项,则需要在开头和结尾添加通配符:

 $ ptpython                                                                                                            
>>> import re                                                                                                         
>>> url = "http://guba.eastmoney.com/list,of161823.html"                                                              
>>> re.match('guba', url)                                                                                             
>>> re.match('.+guba.+', url)                                                                                         
<re.Match object; span=(0, 44), match='http://guba.eastmoney.com/list,of161823.html'>