Question

我有一些html文件，其中包含一段代码的两种不同模式，其中只有name="horizon"是常量。我需要获取名为“value”的属性的值。以下是示例文件： -
文件1：

<tag1> data
</tag1>
<select size="1" name="horizon">
    <option value="Admin">Admin Users</option>
    <option value="Remote Admin">Remote Admin</option>
</select>

文件2：

<othertag some_att="asfa"> data
</othertag>
<select id="realm_17" size="1" name="horizon">
    <option id="option_LoginPage_1" value="Admin Users">Admin Users</option>
    <option id="option_LoginPage_1" value="Global-User">Global-User</option>
</select>

由于文件将包含其他标签和属性，我尝试通过引用this来编写正则表达式，以使用这些正则表达式从文件中过滤所需内容。

regex='^(?:.*?)(<(?P<TAG>\w+).+name\=\"horizon\"(?:.*[\n|\r\n?]*)+?<\/(?P=TAG>)'

我已尝试使用re.MULTILINE和re.DOTALL，但无法获得所需的文字我想，一旦我得到所需文本，我就可以使用re.findall('value\=\"(.*)\",text)找到所需的名称。
请建议是否有任何优雅的方法来处理这种情况。

Answer 1

我完全同意@ZiTAL说，将文件解析为XML会更快更好。

一些简单的代码行可以解决您的问题：

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

# If you prefer to parse the text directly do root = ET.fromstring('<root>example</root>')

values = [el.attrib['value'] for el in root.findall('.//option')]

print(values)

Answer 2

试试这个正则表达式！

value="(.*)">

这是用于从html文件中提取值的简单正则表达式。这个正则表达式显示在双引号和＆amp;之间提取任何东西。在＆＃34; value =＆＃34;之后＆安培;之前＆＃34;＆gt;＆＃34;

我还附上了输出的截图！

Answer 3

我按照@kazbeel的解释尝试了xml.etree.ElementTree模块，但它给了我“错配标签”的错误，我发现在大多数情况下使用它都是这种情况。然后我找到了这个BeautifulSoup模块并使用它，它给出了期望的结果。以下代码涵盖了另一个文件模式以及上述问题 File3：

<input id="realm_90" type="hidden" name="horizon" value="RADIUS">

代码：

from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
    realms=[]
    soup=BeautifulSoup(html_text, 'lxml')
    in_tag=soup.find(attrs={"name":"horizon"})
    if in_tag.name == 'select':
        for tag in in_tag.find_all():
            realms.append(tag.attrs['value'])
    elif in_tag.name == 'input':
        realms.append(in_tag.attrs['value'])
    return realms

我同意@ZiTAL在解析xml / html文件时不使用正则表达式，因为它太复杂了，并且存在number of libraries。

正则表达式匹配来自多行html文本的块

3 个答案: