我有一些Scrapy代码使用Regex抓取网站,以包含我正在寻找的数据的字典的形式查找一些非标准源代码。找到时,数据将打印到屏幕上。
包含用户看到的此数据的表具有多个选项卡。当用户在选项卡之间移动时,XHR请求将刷新后台数据。代码的第二部分尝试打印当用户从'整体'到家庭'选项卡位于以下页面:
http://www.whoscored.com/Teams/32/
代码在这里:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
import re
import json
import requests
class ExampleSpider(CrawlSpider):
name = "goal2"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com"]
download_delay = 5
rules = [Rule(SgmlLinkExtractor(allow=('\Teams'),deny=(),), follow=False, callback='parse_item')]
def parse_item(self, response):
match1 = re.search(re.escape("DataStore.prime('stage-player-stat', defaultTeamPlayerStatsConfigParams.defaultParams , ") \
+ '(\[.*\])' + re.escape(");"), response.body) #regex to match inital data item
if match1 is not None:
playerdata1 = match1.group(1) #if match1 isnt empty then print the dictionary embedded in the source code of the page
print '**********Players by team (Summary - Overall):**********'
print '-' * 170
for player in json.loads(playerdata1):
print ("{TeamId},{PlayerId},{Name}".decode().format(**player))
#submit xhr request to obtain the dictionary that contains the 'Home' data, rather than the 'Overall' data embedded in the source code.
url = 'http://www.whoscored.com/stageplayerstatfeed'
params = {
'field': '1',
'isAscending': 'false',
'orderBy': 'Rating',
'playerId': '-1',
'stageId': '9155',
'teamId': '32'
}
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
'Host': 'www.whoscored.com',
'Referer': 'http://www.whoscored.com/Teams/32/'}
response = requests.get(url, params=params, headers=headers)
fixtures = response.json()
print '**********Players by team (Summary - Home):**********'
print '-' * 170
for player in json.loads(fixtures): #print 'Home' dictionary here:
print ("{TeamId},{PlayerId},{Name}".decode().format(**player))
execute(['scrapy','crawl','goal2'])
此代码抛出一个错误,指出需要字符串或缓冲区。当我尝试转换变量' fixtures'在语句for player in json.loads(fixtures):
中使用之前的字符串我收到错误说:
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
exceptions.ValueError: Expecting property name: line 1 column 3 (char 2)
我假设错误与语句.decode().format(**player))
有关,但我不确定这需要改变什么。
有人可以帮忙吗?
由于
答案 0 :(得分:1)
您正在尝试解码已解码的对象。这就是response.json()
已经照顾的事情。
只需循环遍历fixtures
列表,无需将其传递给json.loads()
:
for player in fixtures:
您可以删除.decode()
方法并改为使用u'...'
unicode字符串文字:
print u"{TeamId},{PlayerId},{Name}".format(**player)
在Python 2中,print
是一个语句,而不是一个函数,除非你在模块顶部使用from __future__ import print_function
。
对于您的示例网址,标题和参数,这会产生:
>>> fixtures = response.json()
>>> for player in fixtures:
... print u"{TeamId},{PlayerId},{Name}".format(**player)
...
32,81726,Phil Jones
32,137795,Tyler Blackett
32,8166,Ashley Young
32,18296,Antonio Valencia
32,22079,Jonny Evans
32,23110,Ángel Di María
32,25363,Juan Mata
32,71345,Chris Smalling
32,5835,Darren Fletcher
32,107941,Michael Keane
32,79554,David de Gea
32,69956,Tom Cleverley
32,3859,Wayne Rooney
32,21723,Anderson
32,4564,Robin van Persie
32,39308,Danny Welbeck
32,130334,Adnan Januzaj