Question

我正在尝试通过Web在HTML源代码中的JavaScript标记内抓取一些数据。

情况：我可以找到相应的<script></script>标签。但是在该标签内，有一个很大的字符串，需要将其转换然后解析，这样我才能获得所需的精确数据。

问题是：我不知道该怎么做，也找不到一个明确而令人满意的答案。

代码如下：

我的目标是获取以下数据："xe7fd4c285496ab91"，这是内容的标识号，也称为"contentId"。

import requests
import bs4
import re

url = 'https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code'
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text,'html.parser') # by the way I am not sure if this is the right way to parse the link

item = soup.find(string=re.compile('contentId')) # with this line I can get directly to the exact javascript tag that I need

print(item) # but as you can see, it's a pretty big string, and I need to parse it to get the desired data. But you can find that the desired data "xe7fd4c285496ab91" is in it.

我尝试使用json.parse()，但不起作用：

import json
jsonparsed=json.parse(item)

出现此错误：

AttributeError: 'NavigableString' object has no attribute 'json'

我的问题是：如何获得所需的数据？有将字符串转换为javascript的函数，以便我可以解析它吗？还是将字符串转换为JSON文件的方法？

（请记住，我将在具有相似HTML / JavaScript的多个链接上执行此操作）。

Answer 1

您可以只对文本使用正则表达式，而无需搜索脚本

import re
import requests

r = requests.get('https://www.khanacademy.org/computing/computer-programming/programming/drawing-basics/pt/making-drawings-with-code')
p = re.compile(r'contentId":"((?:(?!").)*)')  
i = p.findall(r.text)[0]
print(i)

正则表达式

如何在Python中的html源代码中解析javascript代码？

1 个答案: