我将正则表达式保存在JSON文件中。该文件作为我的Spider的配置加载。蜘蛛程序创建一个具有allow
和deny
正则表达式规则的LinkExtractor。
我想:
在某些商店上一切正常,但在其他商店却不行,我认为这是我的正则表达式的问题。
"rules": [
{
"deny": ["\\/(customer\\+service|ways\\+to\\+save|sponsorship|order|cart|company|specials|checkout|integration|blog|brand|account|sitemap|prefn1=)\\/"],
"follow": false
},
{
"allow": ["com\\/store\\/details\\/"],
"follow": true,
"use_content": true
},
{
"allow": ["com\\/store\\/browse\\/"],
"follow": true
}
],
URL模式:
产品
https://www.example.com/store/details/Nike+SB-Portmore-II-Solar-Canvas-Mens https://www.example.com/store/details/Coleman+Renegade-Mens-Hiking https://www.example.com/store/details/Mueller+ATF3-Ankle-Brace https://www.example.com/store/details/Planet%20Fitness+18 https://www.example.com/store/details/Lifeline+Pro-Grip-Ring https://www.example.com/store/details/Nike+Phantom-Vision类别:
https://www.example.com/store/browse/footwear/
https://www.example.com/store/browse/apparel/
https://www.example.com/store/browse/fitness/拒绝:
https://www.example.com/store/customer+service/Online+Customer+Service https://www.example.com/store/checkout/ https://www.example.com/store/ways+to+save/ https://www.example.com/store/specials https://www.example.com/store/company/Privacy+Policy https://www.example.com/store/company/Terms+of+Service
在我的蜘蛛__init__
内从JSON加载规则
for rule in self.MY_SETTINGS["rules"]:
allow_r = ()
if "allow" in rule.keys():
allow_r = [a for a in rule["allow"]]
deny_r = ()
if "deny" in rule.keys():
deny_r = [d for d in rule["deny"]]
restrict_xpaths_r = ()
if "restrict_xpaths" in rule.keys():
restrict_xpaths_r = [rx for rx in rule["restrict_xpaths"]]
Sportygenspider.rules.append(Rule(
LinkExtractor(
allow=allow_r,
deny=deny_r,
restrict_xpaths=restrict_xpaths_r,
),
follow=rule["follow"],
callback='parse_item' if ("use_content" in rule.keys()) else None
))
如果执行pprint(vars(onerule.link_extractor))
,我可以正确看到Python正则表达式:
'deny_res': [re.compile('\\/(customer\\+service|sponsorship|order|cart|company|specials|checkout|integration|blog|account|sitemap|prefn1=)\\/')]
{'allow_domains': set(),
'allow_res': [re.compile('com\\/store\\/details\\/')],
{'allow_domains': set(),
'allow_res': [re.compile('com\\/store\\/browse\\/')],
在https://regex101.com/中测试正则表达式似乎也不错(尽管:我在JSON文件中使用\\/
,在regex101.com中使用\/
)
在我的蜘蛛日志文件中,我可以看到农产品页面正在被爬网,但尚未解析:
2019-02-01 08:25:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/store/details/FILA+Hometown-Mens-Lifestyle-Shoes/5345120230028/_/A-6323521;> (referer: https://www.example.com/store/browse/footwear)
2019-02-01 08:25:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/store/details/FILA+D-Formation-Mens-Lifestyle-Shoes/5345120230027/_/A-6323323> (ref
为什么Spider无法解析产品页面? (相同的代码,不同的JSON在不同的商店使用)
答案 0 :(得分:0)
经过数小时的调试和测试,我发现必须更改规则的顺序。
现在可以正常工作了。
"rules": [
{
"allow": ["com\\/store\\/details\\/"],
"follow": true,
"use_content": true
},
{
"deny": ["\\/(customer\\+service|ways\\+to\\+save|sponsorship|order|cart|company|specials|checkout|integration|blog|brand|account|sitemap|prefn1=)\\/"],
"follow": false
},
{
"allow": ["com\\/store\\/browse\\/"],
"follow": true
}
],