使用Python

时间:2019-10-30 08:47:38

标签: python python-3.x web-scraping beautifulsoup

因此,我正在尝试从具有深层嵌套<script>标签的网站中获取特定数据。

使用import json希望使事情变得更轻松,这会导致著名的Expecting value: line 1 column 1 (char 0)错误。因此,我尝试了以下方法1,但成功率为零。

从本质上讲,连接到站点,捕获特定<script>标签的相对简单的步骤没有问题。从我需要的数据中获取数据似乎有问题。

假设以下元素:

script_tag = '''
<script id="startup" type="text/javascript">
$(document).ready(function () {createJsonChart({
"series":[{"name":"BNames","color":"#0043de","legendIndex":0,
"stack":null,
"data":[{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""},
{"name":"BNames","color":"#0043de","y":114.6,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 114,60 % <br/> Month: oktober 2018"},
{"name":"BNames","color":"#0043de","y":108.5,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 108,50 % <br/> Month: september 2019"},
{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,
"fillColor":null,"symbol":null,"radius":4},
"dashStyle":"Solid","lineWidth":2,
"step":"center","zIndex":"2","name":"Mandatory","color":"#f20808",
"legendIndex":0,"stack":1,
"data":[{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,"fillColor":null,
"symbol":null,"radius":4},"dashStyle":"Solid","lineWidth":2,
"step":"center", "zIndex":"2","name":"Preferred","color":"#38d615",
"legendIndex":0,"stack":2,
"data":[{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"}]}],
"resizeElement":null,"credits":{"enabled":false}});$('#__Page').lumnaInit('');});
</script>
'''

实际上,这个<script>标签更大。它包含3部分数据,分别在这里命名为BNamesMandatoryPreferred。我需要来自BNames的数据,特别是最后一个条目。因此,预期结果将来自部分"tooltip":"BNames: 108,50 % <br/> Month: september 2019"},其中一个变量为BNames: 108,50 %,另一个变量为Month: september 2019

使用正则表达式的答案

url_part=soup.find("script", attrs={'id':'startup'}).text
info=re.findall(r'\s\w*\s\d*', url_part)[-1]
result=re.findall(r'(BNames: (\d+[,]\d+\s[%]))', url_part)[-1][1]

首先定义要使用的HTML标记。其次,找到所有大小为字母(\w*),后跟空格(\s)和任何大小数字(\d*)的所有实例。这将与2019年9月或2019年8月之类的任何内容匹配。最后,查找与BNames:匹配的实例,该实例中带有以下数字的设置:数字,逗号,数字,空格和百分号。因此(\d+[,]\d+\s[%]的确匹配从80.6%到120.05%的所有内容

1 个答案:

答案 0 :(得分:2)

Beleidsdekkingsgraad 字符串上使用以下正则表达式匹配。 BNames 的想法相同。

import re, requests

r = requests.get('https://www.pensioenfondstno.nl/overons/dekkingsgraad')
p = re.compile(r'"(Beleidsdekkingsgraad:[\s\S]*?)"', re.MULTILINE)
data = p.findall(r.text)[-1].split(' <br/> ')
print(data[0])
print(data[1])

正则表达式:

enter image description here