正则表达式删除<select>元素

时间:2019-06-12 18:23:33

标签: python regex python-3.x

我有一个表示HTML中的选择选项组的字符串,我想在Python中使用正则表达式删除<select>元素,而在我的最终字符串中仅保留<option><optgroup>

<select id="id_permissions" multiple="" name="permissions">
      <optgroup label="Auth">
          <option value="4">Can view permission</option>
          <option value="8">Can view group</option>
      </optgroup>
</select>

我该怎么做?

此正则表达式也不起作用,我希望有人可以帮助指导我:

^(?=.*?\<select\b).*$

2 个答案:

答案 0 :(得分:2)

在这里,我们将使用一个简单的表达式:

<select.+>\s*(<[\s\S]*>)\s*<\/select>

我们期望的输出正在该组中被捕获:

(<[\s\S]*>)

Demo

测试

# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility

import re

regex = r"<select.+>\s*(<[\s\S]*>)\s*<\/select>"

test_str = ("<select id=\"id_permissions\" multiple=\"\" name=\"permissions\">\n"
    "      <optgroup label=\"Auth\">\n"
    "          <option value=\"4\">Can view permission</option>\n"
    "          <option value=\"8\">Can view group</option>\n"
    "      </optgroup>\n"
    "</select>")

subst = "\\1"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.

RegEx电路

jex.im可视化正则表达式:

enter image description here

答案 1 :(得分:2)

为什么不使用BeautifulSoup 4?

代码

from bs4 import BeautifulSoup
s = """
<select id="id_permissions" multiple="" name="permissions">
      <optgroup label="Auth">
          <option value="4">Can view permission</option>
          <option value="8">Can view group</option>
      </optgroup>
</select>
"""
soup = BeautifulSoup(s, 'html.parser')
str(soup.find('optgroup'))
'<optgroup label="Auth">\n<option value="4">Can view permission</option>\n<option value="8">Can view group</option>\n</optgroup>'