昨天我问了一个类似的问题,但我不认为我对我想做的事情做了很清楚的解释。我有以下代码:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Teams/32/"]
rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]
def parse_item(self, response):
stagematch = re.compile("data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{.*},",re.S)
stagematch2 = re.search(stagematch, response.body)
if stagematch2 is not None:
stagematch3 = stagematch2.group(1)
stageid = json.loads(stagematch3)
stageid = stageid[0]['StageId']
print stageid
有了这个,我试图在这个link解析一些脚本,格式为:
data:{
url: 'stage-player-stat'
},
defaultParams: {
stageId: 9155,
teamId: 32,
playerId: -1,
field: 2
},
由此,我想提取stageId
的值,在本例中为9155.但是这会引发以下错误:
stagematch3 = stagematch2.group(1)
exceptions.IndexError: no such group
我假设这是因为使用的正则表达式无效,但我看不出问题所在。谁能告诉我哪里出错了?
由于
答案 0 :(得分:1)
data:\s*{\s*url:\s*'stage-player-stat'\s*},\s*defaultParams:\s*{\s*(.*?),.*},
使用此功能。参见演示。