Rule(LinkExtractor(allow=rule.get("allow", None), restrict_xpaths=rule.get("restrict_xpaths", ""),deny=('guba','f10','data','fund.*?\.eastmoney\.com/\d+\.html','quote','.*so\.eastmoney.*','life','/gonggao/')),callback=rule.get("callback", ""),follow=rule.get('follow',True))
规则设置↑
运行日志:
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of166401.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of164206.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)
2019-06-27 10:33:24 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://guba.eastmoney.com/list,of161823.html> (referer: http://fund.eastmoney.com/LOF_jzzzl.html)```
**My settings didn't work** help
答案 0 :(得分:0)
摘自文档:
deny(拒绝一个正则表达式(或列表))–(绝对)URL必须匹配的单个正则表达式(或正则表达式列表)才能被排除(即未提取)。它的优先级高于allow参数。如果未指定(或为空),则不会排除任何链接。
https://doc.scrapy.org/en/latest/topics/link-extractors.html#module-scrapy.linkextractors.lxmlhtml
这意味着deny
是应匹配 url的正则表达式列表。您的代码中定义的模式与您要抓取的网址不匹配-如果要包含某种匹配项,则需要在开头和结尾添加通配符:
$ ptpython
>>> import re
>>> url = "http://guba.eastmoney.com/list,of161823.html"
>>> re.match('guba', url)
>>> re.match('.+guba.+', url)
<re.Match object; span=(0, 44), match='http://guba.eastmoney.com/list,of161823.html'>