我是Scrapy的新手,试图从www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014
这是我的scrape.py
代码,它是Spider文件
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item, Field
import re
ss_base_url = "https://www.springfieldspringfield.co.uk/episode_scripts.php"
class Script(Item):
url = Field()
episode_name = Field()
script = Field()
class SubtitleSpider(CrawlSpider):
name = "scrape"
allowed_domains = ['www.springfieldspringfield.co.uk']
start_urls = [ss_base_url]
rules = (
Rule(LinkExtractor(allow=['/episode_scripts.php?tv-show=bojack-horseman-2014&episode=\w+']),
callback="parse_script",
follow=True),)
def fix_field_names(self, field_name):
field_name = re.sub(" ","_", field_name)
field_name = re.sub(":","", field_name)
return field_name
def parse_script(self, response):
x = HtmlXPathSelector(response)
script = Script()
script['url'] = response.url
script['episode_name'] = "".join(x.select("//h3/text()").extract())
script['script'] = "\n".join(x.select("//div[@class='episode_script']/text()").extract())
return script
我正在尝试从https://www.springfieldspringfield.co.uk/episode_scripts.php?tv-show=bojack-horseman-2014
这些链接中存在字幕
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e01
https://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=bojack-horseman-2014&episode=s01e02
当我跑
时 scrapy crawl --nolog scrape
我应该将上面的链接作为输出。 但它没有什么都没有返回,我哪里错了?
答案 0 :(得分:1)
用于匹配链接的正则表达式包含一个问号,需要转义才能使您的匹配正常工作,如果您将其更改为此字符应该有效:
'\/view_episode_scripts\.php\?tv-show=bojack-horseman-2014&episode=\w+'
使用--nolog运行脚本时,它不会记录链接,因此您也需要将其删除。