Scrapy:提取链接

时间:2018-04-04 22:07:04

标签: python web-scraping scrapy

我是Scrapy的新手,试图从www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014

中提取字幕

这是我的scrape.py代码,它是Spider文件

 from scrapy.spiders import CrawlSpider, Rule
 from scrapy.linkextractors import LinkExtractor
 from scrapy.selector import HtmlXPathSelector
 from scrapy.selector import HtmlXPathSelector
 from scrapy.item import Item, Field
 import re

ss_base_url = "https://www.springfieldspringfield.co.uk/episode_scripts.php"

class Script(Item):
    url = Field()
    episode_name = Field()
    script = Field()

class SubtitleSpider(CrawlSpider):
    name = "scrape"
    allowed_domains = ['www.springfieldspringfield.co.uk']
    start_urls = [ss_base_url]
    rules = (
        Rule(LinkExtractor(allow=['/episode_scripts.php?tv-show=bojack-horseman-2014&episode=\w+']),
             callback="parse_script",
             follow=True),)

    def fix_field_names(self, field_name):
        field_name = re.sub(" ","_", field_name)
        field_name = re.sub(":","", field_name)
        return field_name

    def parse_script(self, response):
        x = HtmlXPathSelector(response)
        script = Script()
        script['url'] = response.url
        script['episode_name'] = "".join(x.select("//h3/text()").extract())
        script['script'] = "\n".join(x.select("//div[@class='episode_script']/text()").extract())
        return script

我正在尝试从https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014

中提取所有季节字幕

这些链接中存在字幕

https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e01

https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e02

当我跑

 scrapy crawl --nolog scrape

我应该将上面的链接作为输出。 但它没有什么都没有返回,我哪里错了?

1 个答案:

答案 0 :(得分:1)

用于匹配链接的正则表达式包含一个问号,需要转义才能使您的匹配正常工作,如果您将其更改为此字符应该有效:

'\/view_episode_scripts\.php\?tv-show=bojack-horseman-2014&episode=\w+'

使用--nolog运行脚本时,它不会记录链接,因此您也需要将其删除。