使用Python中的BeautifulSoup在HTML源代码中获取JS var值

时间:2016-12-07 14:54:11

标签: python beautifulsoup

我正在尝试使用BeautifulSoup从HTML源代码中获取JavaScript var值。

例如我有:

<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>

我希望能在Python中返回var“my”的值

我怎样才能做到这一点?

4 个答案:

答案 0 :(得分:1)

最简单的方法是使用正则表达式来通过BeautifulSoup定位元素并提取所需的子字符串:

import re

from bs4 import BeautifulSoup

data = """
<script>
[other code]
var my = 'hello';
var name = 'hi';
var is = 'halo';
[other code]
</script>
"""

soup = BeautifulSoup(data, "html.parser")

pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)

print(pattern.search(script.text).group(1))

打印hello

答案 1 :(得分:1)

另一个想法是使用 JavaScript解析器并找到变量声明节点,检查标识符是否具有所需值并提取初始值设定项< / em>的。使用slimit parser的示例:

using UnityEngine;
using UnityEngine.UI;
using System.Collections;
using System.Linq;



public class Leaderboard : MonoBehaviour
{

// Array of all the employees

[System.Serializable]
public class LeaderBoard
{
    public string ID;
    public string Team;
    public string VerkoopA;
    public string VerkoopB;
    public string NPS;
    public string Conversie;
}

Text txt;

IEnumerator Start()
{
    // Getting the leaderboard data from mySQL.
    WWW leaderboardsData = new WWW("http://localhost/leaderboards.php");
    yield return leaderboardsData;

    string leaderboardsDataString = fixJson(leaderboardsData.text);

    LeaderBoard[] leaderboard;
    // letting the data wrap through json in order to be used.
    leaderboard = JsonHelper.FromJson<LeaderBoard>(leaderboardsDataString);

    foreach (LeaderBoard employee in leaderboard)
    {
        Debug.Log(employee);

        DisplayLeaderboards(employee);
    }

}

string fixJson(string value)
{
    value = "{\"Items\":" + value + "}";
    return value;
}


// Display the leaderboards 1 by 1.
void DisplayLeaderboards(LeaderBoard employeeApart)
{   
    txt = gameObject.GetComponent<Text>();
    txt.text = employeeApart.ID + "     " + employeeApart.Team + "     " + employeeApart.VerkoopA + "     "  + employeeApart.VerkoopB + "     " + employeeApart.NPS + "     " + employeeApart.Conversie;

    Debug.Log(employeeApart.ID + "   " + employeeApart.Team + "   " + employeeApart.VerkoopA + "   " + employeeApart.VerkoopB + "   " +employeeApart.NPS + "   " + employeeApart.Conversie);
    Debug.Log("ID: " + employeeApart.ID);
    Debug.Log("Team: " + employeeApart.Team);
    Debug.Log("VerkoopA: " + employeeApart.VerkoopA);
    Debug.Log("VerkoopB: " + employeeApart.VerkoopB);
    Debug.Log("NPS: " + employeeApart.NPS);
    Debug.Log("Conversie: " + employeeApart.Conversie);
}

}

打印from bs4 import BeautifulSoup from slimit import ast from slimit.parser import Parser from slimit.visitors import nodevisitor data = """ <script> var my = 'hello'; var name = 'hi'; var is = 'halo'; </script> """ soup = BeautifulSoup(data, "html.parser") script = soup.find("script", text=lambda text: text and "var my" in text) # parse js parser = Parser() tree = parser.parse(script.text) for node in nodevisitor.visit(tree): if isinstance(node, ast.VarDecl) and node.identifier.value == 'my': print(node.initializer.value)

答案 2 :(得分:0)

答案,pattern = re.compile(r"var my = '(.*?)';$", re.MULTILINE | re.DOTALL) 应该采用错误的方法,在同时设置$时必须删除行结束符号re.MULTILINE re.DOTALL

尝试使用python 3.6.4

答案 3 :(得分:0)

以@alecxe 的回答为基础,但考虑到字典数组或平面 json 对象数组的更复杂情况:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<script>
var my = [{'dic1key1':1}, {'dic2key1':1}];
var name = 'hi';
var is = 'halo';
</script>
"""

soup = BeautifulSoup(data, "html.parser")

script = soup.find("script", text=lambda text: text and "var my" in text)

# parse js
parser = Parser()
tree = parser.parse(script.text)
array_items = []
for node in nodevisitor.visit(tree):
    if isinstance(node, ast.VarDecl) and node.identifier.value == 'my':
        for item in node.initializer.items:
            parsed_dict = {getattr(n.left, 'value', '')[1:-1]: getattr(n.right, 'value', '')[1:-1]
                for n in nodevisitor.visit(item)
                if isinstance(n, slimit.ast.Assign)}
        array_items.append(parsed_dict)
print(array_items)