Question

因此，我正在尝试从具有深层嵌套<script>标签的网站中获取特定数据。

使用import json希望使事情变得更轻松，这会导致著名的Expecting value: line 1 column 1 (char 0)错误。因此，我尝试了以下方法1，但成功率为零。

从本质上讲，连接到站点，捕获特定<script>标签的相对简单的步骤没有问题。从我需要的数据中获取数据似乎有问题。

假设以下元素：

script_tag = '''
<script id="startup" type="text/javascript">
$(document).ready(function () {createJsonChart({
"series":[{"name":"BNames","color":"#0043de","legendIndex":0,
"stack":null,
"data":[{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""},
{"name":"BNames","color":"#0043de","y":114.6,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 114,60 % <br/> Month: oktober 2018"},
{"name":"BNames","color":"#0043de","y":108.5,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"BNames: 108,50 % <br/> Month: september 2019"},
{"name":"BNames","color":"#0043de","y":0.0,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0","tooltip":""}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,
"fillColor":null,"symbol":null,"radius":4},
"dashStyle":"Solid","lineWidth":2,
"step":"center","zIndex":"2","name":"Mandatory","color":"#f20808",
"legendIndex":0,"stack":1,
"data":[{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"},
{"name":"Mandatory","color":"#f20808","y":104.1,
"legendIndex":0,
"events":{"click":function(){return false;}},
"subtotal":0.0,"displayValue":"0",
"tooltip":"Mandatory: 104,10 %"}]},
{"type":"line","marker":{"enabled":false,
"linecolor":null,"lineWidth":0,"fillColor":null,
"symbol":null,"radius":4},"dashStyle":"Solid","lineWidth":2,
"step":"center", "zIndex":"2","name":"Preferred","color":"#38d615",
"legendIndex":0,"stack":2,
"data":[{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"},
{"name":"Preferred","color":"#38d615","y":121.0,
"legendIndex":0,
"events":{"click":function(){return false;}},"subtotal":0.0,"displayValue":"0",
"tooltip":"Preferred: 121,00 %"}]}],
"resizeElement":null,"credits":{"enabled":false}});$('#__Page').lumnaInit('');});
</script>
'''

实际上，这个<script>标签更大。它包含3部分数据，分别在这里命名为BNames，Mandatory和Preferred。我需要来自BNames的数据，特别是最后一个条目。因此，预期结果将来自部分"tooltip":"BNames: 108,50 % <br/> Month: september 2019"}，其中一个变量为BNames: 108,50 %，另一个变量为Month: september 2019。

使用正则表达式的答案

url_part=soup.find("script", attrs={'id':'startup'}).text
info=re.findall(r'\s\w*\s\d*', url_part)[-1]
result=re.findall(r'(BNames: (\d+[,]\d+\s[%]))', url_part)[-1][1]

首先定义要使用的HTML标记。其次，找到所有大小为字母（\w*），后跟空格（\s）和任何大小数字（\d*）的所有实例。这将与2019年9月或2019年8月之类的任何内容匹配。最后，查找与BNames:匹配的实例，该实例中带有以下数字的设置：数字，逗号，数字，空格和百分号。因此(\d+[,]\d+\s[%]的确匹配从80.6％到120.05％的所有内容

Answer 1

在 Beleidsdekkingsgraad 字符串上使用以下正则表达式匹配。 BNames 的想法相同。

import re, requests

r = requests.get('https://www.pensioenfondstno.nl/overons/dekkingsgraad')
p = re.compile(r'"(Beleidsdekkingsgraad:[\s\S]*?)"', re.MULTILINE)
data = p.findall(r.text)[-1].split(' <br/> ')
print(data[0])
print(data[1])

正则表达式：

使用Python

使用正则表达式的答案

1 个答案: