Question

我正在尝试学习scrapy，我目前正在尝试解析英国广播公司网站。

我觉得我已经做好了一切，但规则只生成一个链接。这是代码：

class BBCSpider(CrawlSpider):
    name = "bbc"
    allowed_domains = ["http://www.bbc.com"]
    start_urls = [
        "http://www.bbc.com/news/world",
    ]

    rules = [
        Rule(LinkExtractor(allow=r"http://www.bbc.com/news/world-.*"),
             callback='parse_item', follow=True)
    ]


    def parse_item(self, response):
        print(response)

目前，只生成一个链接（http://www.bbc.com/news/world-middle-east-33833400）。我完全不知道为什么。正则表达式匹配页面上的更多链接。

提前多多感谢。

Answer 1

很多链接都是这样的（带有相对URL）：

<a href="/news/world-middle-east-33833400" class="title-link">
    ...
</a>

仅检查news/world-.*：

rules = [
    Rule(LinkExtractor(allow=r"/news/world-.*"),
         callback='parse_item', follow=True)
]

此外，allowed_domains应包含域：

allowed_domains = ["bbc.com"]

scrapy规则生成单个链接

1 个答案: