Scrapy从javascript脚本中提取数据

时间:2017-12-09 05:46:00

标签: javascript python xpath scrapy web-crawler

我正试图从espn网站中提取游戏的几率。 'moneyLine'赔率隐藏在一个脚本中,我无法弄清楚如何访问。理想情况下,我会为每场比赛排队。我已经成功地提取了团队名称和行数,我希望有机会接受它。

scrapy shell
fetch('http://www.espn.com/nfl/schedule/_/week/1')
response.xpath("//script[contains(., 'moneyLine')]/text()")

这是输出

[<Selector xpath="//script[contains(., 'moneyLine')]/text()" data='\n\t\t\tvar espn = espn || {};\n\n\t\t\t// Build '>]

以下是来自firefox检查窗口的示例,我可以看到'moneyLine'项目,只是无法隔离它们 enter image description here

1 个答案:

答案 0 :(得分:2)

您的数据位于<script>之间{J}格式data:queue:之间。

您可以使用标准字符串函数(即find(),切片)来切断此部分 然后你可以使用模块json转换为python字典 然后你必须只找到这个词典中moneyLine的位置。

scrapy shell 'http://www.espn.com/nfl/schedule/_/week/1'

# get `<script>` as text
items = response.xpath("//script[contains(., 'moneyLine')]/text()")
txt = items.extract_first()

# find start and end of data 
#(I found this manually checking txt)
start = txt.find('data:') + 6 # manually found how many add to get correct JSON string
end = txt.find('queue:') - 6  # manually found how many substract to get correct JSON string

json_string = txt[start:end]

# convert to python dictionary
import json
data = json.loads(json_string)

# example data 
#(I found this manually using `data.keys(), data['sports'][0].keys(), etc.)
data['sports'][0]['leagues'][0]['events'][0]['odds']['home']['moneyLine']