我正在尝试使用BeautifulSoup从HTML源代码中获取JavaScript var值。
例如我有:
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
我希望能在Python中返回var“my”的值
我怎样才能做到这一点?
答案 0 :(得分:1)
最简单的方法是使用正则表达式来通过BeautifulSoup
定位元素并提取所需的子字符串:
import re
from bs4 import BeautifulSoup
data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""
soup = BeautifulSoup(data, "html.parser")
pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
print(pattern.search(script.text).group(1))
打印hello
。
答案 1 :(得分:1)
另一个想法是使用 JavaScript解析器并找到变量声明节点,检查标识符是否具有所需值并提取初始值设定项< / em>的。使用slimit
parser的示例:
using UnityEngine;
using UnityEngine.UI;
using System.Collections;
using System.Linq;
public class Leaderboard : MonoBehaviour
{
// Array of all the employees
[System.Serializable]
public class LeaderBoard
{
public string ID;
public string Team;
public string VerkoopA;
public string VerkoopB;
public string NPS;
public string Conversie;
}
Text txt;
IEnumerator Start()
{
// Getting the leaderboard data from mySQL.
WWW leaderboardsData = new WWW("http://localhost/leaderboards.php");
yield return leaderboardsData;
string leaderboardsDataString = fixJson(leaderboardsData.text);
LeaderBoard[] leaderboard;
// letting the data wrap through json in order to be used.
leaderboard = JsonHelper.FromJson<LeaderBoard>(leaderboardsDataString);
foreach (LeaderBoard employee in leaderboard)
{
Debug.Log(employee);
DisplayLeaderboards(employee);
}
}
string fixJson(string value)
{
value = "{\"Items\":" + value + "}";
return value;
}
// Display the leaderboards 1 by 1.
void DisplayLeaderboards(LeaderBoard employeeApart)
{
txt = gameObject.GetComponent<Text>();
txt.text = employeeApart.ID + " " + employeeApart.Team + " " + employeeApart.VerkoopA + " " + employeeApart.VerkoopB + " " + employeeApart.NPS + " " + employeeApart.Conversie;
Debug.Log(employeeApart.ID + " " + employeeApart.Team + " " + employeeApart.VerkoopA + " " + employeeApart.VerkoopB + " " +employeeApart.NPS + " " + employeeApart.Conversie);
Debug.Log("ID: " + employeeApart.ID);
Debug.Log("Team: " + employeeApart.Team);
Debug.Log("VerkoopA: " + employeeApart.VerkoopA);
Debug.Log("VerkoopB: " + employeeApart.VerkoopB);
Debug.Log("NPS: " + employeeApart.NPS);
Debug.Log("Conversie: " + employeeApart.Conversie);
}
}
打印from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<script>
var my = 'hello';
var name = 'hi';
var is = 'halo';
</script>
"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: text and "var my" in text)
# parse js
parser = Parser()
tree = parser.parse(script.text)
for node in nodevisitor.visit(tree):
if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
print(node.initializer.value)
。
答案 2 :(得分:0)
答案,pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
应该采用错误的方法,在同时设置$
时必须删除行结束符号re.MULTILINE re.DOTALL
。
尝试使用python 3.6.4
答案 3 :(得分:0)
以@alecxe 的回答为基础,但考虑到字典数组或平面 json 对象数组的更复杂情况:
from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor
data = """
<script>
var my = [{'dic1key1':1}, {'dic2key1':1}];
var name = 'hi';
var is = 'halo';
</script>
"""
soup = BeautifulSoup(data, "html.parser")
script = soup.find("script", text=lambda text: text and "var my" in text)
# parse js
parser = Parser()
tree = parser.parse(script.text)
array_items = []
for node in nodevisitor.visit(tree):
if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
for item in node.initializer.items:
parsed_dict = {getattr(n.left, 'value', '')[1:-1]: getattr(n.right, 'value', '')[1:-1]
for n in nodevisitor.visit(item)
if isinstance(n, slimit.ast.Assign)}
array_items.append(parsed_dict)
print(array_items)