Question

在我的Scrapy输出文件中，我发现有些项目丢失了，所以我手动添加这些缺失的页面作为第三条规则。

class KjvSpider(CrawlSpider):
    name = 'kjv'
    start_urls = ['file:///G:/OEBPS2/bible-toc.xhtml']

    rules = (
        Rule(LinkExtractor(allow=r'OEBPS'), follow=True),      # 1st rule

        Rule(LinkExtractor(allow=r'\d\.xhtml$'),
             callback='parse_item', follow=False),             # 2nd rule
        Rule(LinkExtractor(allow=[r'2-jn.xhtml$', r'jude.xhtml$', r'obad.xhtml$', r'philem.xhtml$'], ),
             callback='parse_item', follow=False),             # 3rd rule
    )

如果我启用1st rule和3rd rule（注释2nd rule），我可以正确下载四个缺失的项目，但不能下载整个项目（大约2000个主题）。

但是，如果我启用所有三个规则，结果是丢失的项目仍然缺失。（即如果我添加3rd rule，则无差异。）

我不知道为什么规则不起作用。

欢迎任何建议。提前谢谢。

Answer 1

我发现我必须在1st rule中拒绝这些丢失的网址，以便在3rd rule中，它不会被过滤掉为重复的请求。所以它将正常获取。

e.g。

rules = (
    Rule(LinkExtractor(allow=r'OEBPS',deny=(r'2-jn.xhtml$', r'jude.xhtml$', 
         r'obad.xhtml$',r'philem.xhtml$')), follow=True),   # 1st rule

    Rule(LinkExtractor(allow=r'\d\.xhtml$'),
         callback='parse_item', follow=False),              # 2nd rule
    Rule(LinkExtractor(allow=[r'2-jn.xhtml$', r'jude.xhtml$', r'obad.xhtml$', r'philem.xhtml$'], ),
         callback='parse_item', follow=False),              # 3rd rule
)

添加了Scrapy规则，但没有删除更多项目

1 个答案: