Question

昨天我问了一个类似的问题，但我不认为我对我想做的事情做了很清楚的解释。我有以下代码：

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import Selector
    from scrapy.item import Item
    from scrapy.spider import BaseSpider
    from scrapy import log
    from scrapy.cmdline import execute
    from scrapy.utils.markup import remove_tags
    import time
    import re
    import json
    import requests


    class ExampleSpider(CrawlSpider):
        name = "goal2"
        allowed_domains = ["whoscored.com"]
        start_urls = ["http://www.whoscored.com/Teams/32/"]

        rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]

        def parse_item(self, response):

            stagematch = re.compile("data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{.*},",re.S)

            stagematch2 = re.search(stagematch, response.body)

            if stagematch2 is not None:
                stagematch3 = stagematch2.group(1)


                stageid = json.loads(stagematch3)
                stageid = stageid[0]['StageId']

                print stageid

有了这个，我试图在这个link解析一些脚本，格式为：

data:{
                url: 'stage-player-stat'
            },
            defaultParams: {
                stageId: 9155,
                teamId: 32,
                playerId: -1,
                field: 2
            },

由此，我想提取stageId的值，在本例中为9155.但是这会引发以下错误：

stagematch3 = stagematch2.group(1)
    exceptions.IndexError: no such group

我假设这是因为使用的正则表达式无效，但我看不出问题所在。谁能告诉我哪里出错了？

由于

Answer 1

  data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},

使用此功能。参见演示。

http://regex101.com/r/iX5xR2/4

正则表达式：＆＃39; exceptions.IndexError：没有这样的组＆＃39;

1 个答案: