我在python中编写一个正则表达式来捕获SSI标记内的内容。
我想解析标签:
<!--#include file="/var/www/localhost/index.html" set="one" -->
进入以下组成部分:
include
,echo
或set
)=
符号"
&#39; 问题在于我对如何抓取这些重复组感到茫然,因为名称/值对可能会在标记中出现一次或多次。我花了好几个小时。
这是我当前的正则表达式字符串:
^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$
它捕获第一组中的include
和第二组中的file="/var/www/localhost/index.html" set="one"
,但我所追求的是:
group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"
(continue for every other name="value" pair)
答案 0 :(得分:2)
抓住可以重复的所有内容,然后单独解析它们。这也可能是命名组的一个很好的用例!
import re
data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''
result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')
然后迭代它:
g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
key, value = keyvalue.split('=')
# do something with them
答案 1 :(得分:1)
我建议不要使用单个正则表达式来捕获重复组中的每个项目。相反 - 不幸的是,我不懂Python,所以我用我理解的语言回答它,这是Java - 我建议首先提取所有属性,然后循环遍历每个项目,如下所示:
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop {
public static final void main(String[] ignored) {
String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";
Matcher m = Pattern.compile(
"<!--#(include|echo|set) +(.*)-->").matcher(input);
m.matches();
String tagFunc = m.group(1);
String allAttrs = m.group(2);
System.out.println("Tag function: " + tagFunc);
System.out.println("All attributes: " + allAttrs);
m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
while(m.find()) {
System.out.println("name=\"" + m.group(1) +
"\", value=\"" + m.group(2) + "\"");
}
}
}
输出:
Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"
以下是可能感兴趣的答案:https://stackoverflow.com/a/23062553/2736496
请考虑将Stack Overflow Regular Expressions FAQ加入书签以供将来参考。
答案 2 :(得分:1)
使用new python regex module的方式:
#!/usr/bin/python
import regex
s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
p = r'''(?x)
(?>
\G(?<!^)
|
<!-- \# (?<function> [a-z]+ )
)
\s+
(?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''
matches = regex.finditer(p, s)
for m in matches:
if m.group("function"):
print ("function: " + m.group("function"))
print (" key: " + m.group("key") + "\n value: " + m.group("val") + "\n")
使用re模块的方式:
#!/usr/bin/python
import re
s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
p = r'''(?x)
<!-- \# (?P<function> [a-z]+ )
\s+
(?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
-->
'''
matches = re.finditer(p, s)
for m in matches:
print ("function: " + m.group("function"))
for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
if param.group(1):
print (" value: " + param.group(1) + "\n")
else:
print (" key: " + param.group())
答案 3 :(得分:0)
不幸的是,python不允许使用递归正则表达式 你可以这样做:
import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
key, val in item.split('=')
# You can now do whatever you want with the key=val pair
答案 4 :(得分:0)
regex
库允许捕获重复的组(而内置的 re
不允许)。这提供了一个简单的解决方案,无需外部 for 循环来解析组。
import regex
string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')
match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n ')
给你你想要的
include
file = /var/www/localhost/index.html
set = one