与BaseSpider一起使用的正则表达式会导致CrawlSpider出错

时间:2014-08-01 19:08:40

标签: python regex json scrapy

我在Windows Vista 64位上使用Python.org版本2.7 64位。我有以下代码包含一个名为Datastore.prime的Javascript项目的正则表达式,我知道它确实存在于我正在使用BaseSpider进行实验的静态页面上:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json


class ExampleSpider(CrawlSpider):
    name = "goal4"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com"]
    download_delay = 1

    rules = [Rule(SgmlLinkExtractor(allow=('/Teams',)), follow=True, callback='parse_item')]

    def parse_item(self, response):


        playerdata = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
                     + '(\[.*\])' + re.escape(");"), response.body).group(1)

        for player in json.loads(playerdata):
            print player['FirstName'], player['LastName'], player['TeamName'], player['PositionText'], player['PositionLong'] \
            , player['Age'] \
            , player['Height'], player['Weight'], player['GameStarted'], player['SubOn'], player['SubOff'] \
            , player['Goals'], player['OwnGoals'], player['Assists'], player['Yellow'], player['SecondYellow'], player['Red'] \
            , player['TotalShots'] \
            , player['ShotsOnTarget'], player['ShotsBlocked'], player['TotalPasses'], player['AccuratePasses'], player['KeyPasses'] \
            , player['TotalLongBalls'], player['AccurateLongBalls'], player['TotalThroughBalls'], player['AccurateThroughBalls'] \
            , player['AerialWon'], player['AerialLost'], player['TotalTackles'], player['Interceptions'], player['Fouls'] \
            , player['Offsides'], player['OffsidesWon'], player['TotalClearances'], player['WasDribbled'], player['Dribbles'] \
            , player['WasFouled'] \
            , player['Dispossesed'], player['Turnovers'], player['TotalCrosses'], player['AccurateCrosses'] \

execute(['scrapy','crawl','goal4'])

当此正则表达式用作CrawlSpider的一部分时,如上例所示,代码会引发以下错误:

 Traceback (most recent call last):
   File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
     self.runUntilCurrent()
   File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
     call.func(*call.args, **call.kw)
   File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
     self._startRunCallbacks(result)
   File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
     self._runCallbacks()
 --- <exception caught here> ---
   File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
     current.result = callback(current.result, *args, **kw)
   File "c:\Python27\missing\missing\spiders\mrcrawl2.py", line 26, in parse
     + '(\[.*\])' + re.escape(");"), response.body).group(1)
 exceptions.AttributeError: 'NoneType' object has no attribute 'group'

我知道这个例子工作的静态页面可以在这里找到:

http://www.whoscored.com/Teams/705/Archive/Israel-Maccabi-Haifa 我假设如果Scrapy尝试解析没有遇到DateStore.prime实例的页面,则会导致上述错误。有人可以告诉我是否:

1)这个假设是正确的 2)我如何解决这个问题。我尝试使用'try:'和'except:'实例,但是我不确定如何编写“if error crawl next page”的内容。

由于

1 个答案:

答案 0 :(得分:1)

问题来自将方法调用searchgroup链接在一起。如果search返回None,则None.group会提升AttributeError

相反,拆分两个方法调用并使用if match is not None。例如:

def parse_item(self, response):

    match = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
                 + '(\[.*\])' + re.escape(");"), response.body)
    if match is not None:
        playerdata = match.group(1)

        for player in json.loads(playerdata):
            ...