Question

我正在尝试使用BeautifulSoup从HTML源代码中获取JavaScript var值。

例如我有：

<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>

我希望能在Python中返回var“my”的值

我怎样才能做到这一点？

Answer 1

最简单的方法是使用正则表达式来通过BeautifulSoup定位元素并提取所需的子字符串：

import re

from bs4 import BeautifulSoup

data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

print(pattern.search(script.text).group(1))

打印hello。

Answer 2

另一个想法是使用 JavaScript解析器并找到变量声明节点，检查标识符是否具有所需值并提取初始值设定项< / em>的。使用slimit parser的示例：

using UnityEngine; using UnityEngine.UI; using System.Collections; using System.Linq; public class Leaderboard : MonoBehaviour { // Array of all the employees [System.Serializable] public class LeaderBoard { public string ID; public string Team; public string VerkoopA; public string VerkoopB; public string NPS; public string Conversie; } Text txt; IEnumerator Start() { // Getting the leaderboard data from mySQL. WWW leaderboardsData = new WWW("http://localhost/leaderboards.php"); yield return leaderboardsData; string leaderboardsDataString = fixJson(leaderboardsData.text); LeaderBoard[] leaderboard; // letting the data wrap through json in order to be used. leaderboard = JsonHelper.FromJson<LeaderBoard>(leaderboardsDataString); foreach (LeaderBoard employee in leaderboard) { Debug.Log(employee); DisplayLeaderboards(employee); } } string fixJson(string value) { value = "{\"Items\":" + value + "}"; return value; } // Display the leaderboards 1 by 1. void DisplayLeaderboards(LeaderBoard employeeApart) { txt = gameObject.GetComponent<Text>(); txt.text = employeeApart.ID + " " + employeeApart.Team + " " + employeeApart.VerkoopA + " " + employeeApart.VerkoopB + " " + employeeApart.NPS + " " + employeeApart.Conversie; Debug.Log(employeeApart.ID + " " + employeeApart.Team + " " + employeeApart.VerkoopA + " " + employeeApart.VerkoopB + " " +employeeApart.NPS + " " + employeeApart.Conversie); Debug.Log("ID: " + employeeApart.ID); Debug.Log("Team: " + employeeApart.Team); Debug.Log("VerkoopA: " + employeeApart.VerkoopA); Debug.Log("VerkoopB: " + employeeApart.VerkoopB); Debug.Log("NPS: " + employeeApart.NPS); Debug.Log("Conversie: " + employeeApart.Conversie); } }

打印from bs4 import BeautifulSoup from slimit import ast from slimit.parser import Parser from slimit.visitors import nodevisitor data = """ <script> var my = 'hello'; var name = 'hi'; var is = 'halo'; </script> """ soup = BeautifulSoup(data, "html.parser") script = soup.find("script", text=lambda text: text and "var my" in text) # parse js parser = Parser() tree = parser.parse(script.text) for node in nodevisitor.visit(tree): if isinstance(node, ast.VarDecl) and node.identifier.value == 'my': print(node.initializer.value)。

Answer 3

答案，pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL) 应该采用错误的方法，在同时设置$时必须删除行结束符号re.MULTILINE re.DOTALL。

尝试使用python 3.6.4

Answer 4

以@alecxe 的回答为基础，但考虑到字典数组或平面 json 对象数组的更复杂情况：

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<script>
var my = [{'dic1key1':1}, {'dic2key1':1}];
var name = 'hi';
var is = 'halo';
</script>
"""

soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=lambda text: text and "var my" in text)

# parse js
parser = Parser()
tree = parser.parse(script.text)
array_items = []
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
        for item in node.initializer.items:
            parsed_dict = {getattr(n.left, 'value', '')[1:-1]: getattr(n.right, 'value', '')[1:-1]
                for n in nodevisitor.visit(item)
                if isinstance(n, slimit.ast.Assign)}
        array_items.append(parsed_dict)
print(array_items)

使用Python中的BeautifulSoup在HTML源代码中获取JS var值

4 个答案: