我有一个表示HTML中的选择选项组的字符串,我想在Python中使用正则表达式删除<select>
元素,而在我的最终字符串中仅保留<option>
和<optgroup>
。
<select id="id_permissions" multiple="" name="permissions">
<optgroup label="Auth">
<option value="4">Can view permission</option>
<option value="8">Can view group</option>
</optgroup>
</select>
我该怎么做?
此正则表达式也不起作用,我希望有人可以帮助指导我:
^(?=.*?\<select\b).*$
答案 0 :(得分:2)
在这里,我们将使用一个简单的表达式:
<select.+>\s*(<[\s\S]*>)\s*<\/select>
我们期望的输出正在该组中被捕获:
(<[\s\S]*>)
# coding=utf8
# the above tag defines encoding for this document and is for Python 2.x compatibility
import re
regex = r"<select.+>\s*(<[\s\S]*>)\s*<\/select>"
test_str = ("<select id=\"id_permissions\" multiple=\"\" name=\"permissions\">\n"
" <optgroup label=\"Auth\">\n"
" <option value=\"4\">Can view permission</option>\n"
" <option value=\"8\">Can view group</option>\n"
" </optgroup>\n"
"</select>")
subst = "\\1"
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
# Note: for Python 2.7 compatibility, use ur"" to prefix the regex and u"" to prefix the test string and substitution.
jex.im可视化正则表达式:
答案 1 :(得分:2)
为什么不使用BeautifulSoup 4?
from bs4 import BeautifulSoup
s = """
<select id="id_permissions" multiple="" name="permissions">
<optgroup label="Auth">
<option value="4">Can view permission</option>
<option value="8">Can view group</option>
</optgroup>
</select>
"""
soup = BeautifulSoup(s, 'html.parser')
str(soup.find('optgroup'))
'<optgroup label="Auth">\n<option value="4">Can view permission</option>\n<option value="8">Can view group</option>\n</optgroup>'