我在Windows Vista 64位上使用Python.org版本2.7 64位。我有以下代码包含一个名为Datastore.prime的Javascript项目的正则表达式,我知道它确实存在于我正在使用BaseSpider进行实验的静态页面上:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
class ExampleSpider(CrawlSpider):
name = "goal4"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 1
rules = [Rule(SgmlLinkExtractor(allow=('/Teams',)), follow=True, callback='parse_item')]
def parse_item(self, response):
playerdata = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body).group(1)
for player in json.loads(playerdata):
print player['FirstName'], player['LastName'], player['TeamName'], player['PositionText'], player['PositionLong'] \
, player['Age'] \
, player['Height'], player['Weight'], player['GameStarted'], player['SubOn'], player['SubOff'] \
, player['Goals'], player['OwnGoals'], player['Assists'], player['Yellow'], player['SecondYellow'], player['Red'] \
, player['TotalShots'] \
, player['ShotsOnTarget'], player['ShotsBlocked'], player['TotalPasses'], player['AccuratePasses'], player['KeyPasses'] \
, player['TotalLongBalls'], player['AccurateLongBalls'], player['TotalThroughBalls'], player['AccurateThroughBalls'] \
, player['AerialWon'], player['AerialLost'], player['TotalTackles'], player['Interceptions'], player['Fouls'] \
, player['Offsides'], player['OffsidesWon'], player['TotalClearances'], player['WasDribbled'], player['Dribbles'] \
, player['WasFouled'] \
, player['Dispossesed'], player['Turnovers'], player['TotalCrosses'], player['AccurateCrosses'] \
execute(['scrapy','crawl','goal4'])
当此正则表达式用作CrawlSpider的一部分时,如上例所示,代码会引发以下错误:
Traceback (most recent call last):
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 1201, in mainLoop
self.runUntilCurrent()
File "c:\Python27\lib\site-packages\twisted\internet\base.py", line 824, in runUntilCurrent
call.func(*call.args, **call.kw)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 383, in callback
self._startRunCallbacks(result)
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 491, in _startRunCallbacks
self._runCallbacks()
--- <exception caught here> ---
File "c:\Python27\lib\site-packages\twisted\internet\defer.py", line 578, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "c:\Python27\missing\missing\spiders\mrcrawl2.py", line 26, in parse
+ '(\[.*\])' + re.escape(");"), response.body).group(1)
exceptions.AttributeError: 'NoneType' object has no attribute 'group'
我知道这个例子工作的静态页面可以在这里找到:
http://www.whoscored.com/Teams/705/Archive/Israel-Maccabi-Haifa 我假设如果Scrapy尝试解析没有遇到DateStore.prime实例的页面,则会导致上述错误。有人可以告诉我是否:
1)这个假设是正确的 2)我如何解决这个问题。我尝试使用'try:'和'except:'实例,但是我不确定如何编写“if error crawl next page”的内容。
由于
答案 0 :(得分:1)
问题来自将方法调用search
和group
链接在一起。如果search
返回None
,则None.group
会提升AttributeError
。
相反,拆分两个方法调用并使用if match is not None
。例如:
def parse_item(self, response):
match = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body)
if match is not None:
playerdata = match.group(1)
for player in json.loads(playerdata):
...