Question

我将正则表达式保存在JSON文件中。该文件作为我的Spider的配置加载。蜘蛛程序创建一个具有allow和deny正则表达式规则的LinkExtractor。我想：

抓取和抓取产品页面（抓取/解析不起作用）
抓取类别页面
避免使用常规页面（关于我们，隐私等）

在某些商店上一切正常，但在其他商店却不行，我认为这是我的正则表达式的问题。

"rules": [
    {
        "deny": ["\\/(customer\\+service|ways\\+to\\+save|sponsorship|order|cart|company|specials|checkout|integration|blog|brand|account|sitemap|prefn1=)\\/"],
        "follow": false
    },
    {
        "allow": ["com\\/store\\/details\\/"],
        "follow": true,
        "use_content": true
    },
    {
        "allow": ["com\\/store\\/browse\\/"],
        "follow": true
    }
],

URL模式：

产品
  https://www.example.com/store/details/Nike+SB-Portmore-II-Solar-Canvas-Mens   https://www.example.com/store/details/Coleman+Renegade-Mens-Hiking   https://www.example.com/store/details/Mueller+ATF3-Ankle-Brace   https://www.example.com/store/details/Planet%20Fitness+18   https://www.example.com/store/details/Lifeline+Pro-Grip-Ring   https://www.example.com/store/details/Nike+Phantom-Vision

类别：
  https://www.example.com/store/browse/footwear/
  https://www.example.com/store/browse/apparel/
  https://www.example.com/store/browse/fitness/

拒绝：
  https://www.example.com/store/customer+service/Online+Customer+Service   https://www.example.com/store/checkout/   https://www.example.com/store/ways+to+save/   https://www.example.com/store/specials   https://www.example.com/store/company/Privacy+Policy   https://www.example.com/store/company/Terms+of+Service

在我的蜘蛛__init__内从JSON加载规则

for rule in self.MY_SETTINGS["rules"]:
    allow_r = ()
    if "allow" in rule.keys():
        allow_r = [a for a in rule["allow"]]

    deny_r = ()
    if "deny" in rule.keys():
        deny_r = [d for d in rule["deny"]]

    restrict_xpaths_r = ()
    if "restrict_xpaths" in rule.keys():
        restrict_xpaths_r = [rx for rx in rule["restrict_xpaths"]]

    Sportygenspider.rules.append(Rule(
        LinkExtractor(
            allow=allow_r,
            deny=deny_r,
            restrict_xpaths=restrict_xpaths_r,
        ),
        follow=rule["follow"],
        callback='parse_item' if ("use_content" in rule.keys()) else None
    ))

如果执行pprint(vars(onerule.link_extractor))，我可以正确看到Python正则表达式：

'deny_res': [re.compile('\\/(customer\\+service|sponsorship|order|cart|company|specials|checkout|integration|blog|account|sitemap|prefn1=)\\/')]

{'allow_domains': set(),
 'allow_res': [re.compile('com\\/store\\/details\\/')],

{'allow_domains': set(),
 'allow_res': [re.compile('com\\/store\\/browse\\/')],

在https://regex101.com/中测试正则表达式似乎也不错（尽管：我在JSON文件中使用\\/，在regex101.com中使用\/）

在我的蜘蛛日志文件中，我可以看到农产品页面正在被爬网，但尚未解析：

2019-02-01 08:25:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/store/details/FILA+Hometown-Mens-Lifestyle-Shoes/5345120230028/_/A-6323521;> (referer: https://www.example.com/store/browse/footwear)  
2019-02-01 08:25:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.example.com/store/details/FILA+D-Formation-Mens-Lifestyle-Shoes/5345120230027/_/A-6323323> (ref

为什么Spider无法解析产品页面？（相同的代码，不同的JSON在不同的商店使用）

Answer 1

经过数小时的调试和测试，我发现必须更改规则的顺序。

要刮除规则的产品
拒绝了解我们等
要遵循的类别

现在可以正常工作了。

"rules": [
    {
        "allow": ["com\\/store\\/details\\/"],
        "follow": true,
        "use_content": true
    },
    {
        "deny": ["\\/(customer\\+service|ways\\+to\\+save|sponsorship|order|cart|company|specials|checkout|integration|blog|brand|account|sitemap|prefn1=)\\/"],
        "follow": false
    },
    {
        "allow": ["com\\/store\\/browse\\/"],
        "follow": true
    }
],

LinkExtractor中的多个正则表达式似乎不起作用

1 个答案: