Question

我在python中编写一个正则表达式来捕获SSI标记内的内容。

我想解析标签：

<!--#include file="/var/www/localhost/index.html" set="one" -->

进入以下组成部分：

标记功能（例如：include，echo或set）
在=符号
在"＆＃39;

问题在于我对如何抓取这些重复组感到茫然，因为名称/值对可能会在标记中出现一次或多次。我花了好几个小时。

这是我当前的正则表达式字符串：

^\<\!\-\-\#([a-z]+?)\s([a-z]*\=\".*\")+? \-\-\>$

它捕获第一组中的include和第二组中的file="/var/www/localhost/index.html" set="one"，但我所追求的是：

group 1: "include"
group 2: "file"
group 3: "/var/www/localhost/index.html"
group 4 (optional): "set"
group 5 (optional): "one"

(continue for every other name="value" pair)

I am using this site to develop my regex

Answer 1

抓住可以重复的所有内容，然后单独解析它们。这也可能是命名组的一个很好的用例！

import re

data = """<!--#include file="/var/www/localhost/index.html" set="one" reset="two" -->"""
pat = r'''^<!--#([a-z]+) ([a-z]+)="(.*?)" ((?:[a-z]+?=".+")+?) -->'''

result = re.match(pat, data)
result.groups()
('include', 'file', '/var/www/localhost/index.html', 'set="one" reset="two"')

然后迭代它：

g1, g2, g3, g4 = result.groups()
for keyvalue in g4.split(): # split on whitespace
    key, value = keyvalue.split('=')
    # do something with them

Answer 2

我建议不要使用单个正则表达式来捕获重复组中的每个项目。相反 - 不幸的是，我不懂Python，所以我用我理解的语言回答它，这是Java - 我建议首先提取所有属性，然后循环遍历每个项目，如下所示：

   import  java.util.regex.Pattern;
   import  java.util.regex.Matcher;
public class AllAttributesInTagWithRegexLoop  {
   public static final void main(String[] ignored)  {
      String input = "<!--#include file=\"/var/www/localhost/index.html\" set=\"one\" -->";

      Matcher m = Pattern.compile(
         "<!--#(include|echo|set) +(.*)-->").matcher(input);

      m.matches();

      String tagFunc = m.group(1);
      String allAttrs = m.group(2);

      System.out.println("Tag function: " + tagFunc);
      System.out.println("All attributes: " + allAttrs);

      m = Pattern.compile("(\\w+)=\"([^\"]+)\"").matcher(allAttrs);
      while(m.find())  {
         System.out.println("name=\"" + m.group(1) + 
            "\", value=\"" + m.group(2) + "\"");
      }
   }
}

输出：

Tag function: include
All attributes: file="/var/www/localhost/index.html" set="one"
name="file", value="/var/www/localhost/index.html"
name="set", value="one"

以下是可能感兴趣的答案：https://stackoverflow.com/a/23062553/2736496

请考虑将Stack Overflow Regular Expressions FAQ加入书签以供将来参考。

Answer 3

使用new python regex module的方式：

#!/usr/bin/python

import regex

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    (?>
        \G(?<!^)
      |
        <!-- \# (?<function> [a-z]+ )
    )
    \s+
    (?<key> [a-z]+ ) \s* = \s* " (?<val> [^"]* ) "
'''

matches = regex.finditer(p, s)

for m in matches:
    if m.group("function"):
        print ("function: " + m.group("function"))
    print (" key:   " + m.group("key") + "\n value: " + m.group("val") + "\n")

使用re模块的方式：

#!/usr/bin/python

import re

s = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'

p = r'''(?x)
    <!-- \# (?P<function> [a-z]+ )
    \s+
    (?P<params> (?: [a-z]+ \s* = \s* " [^"]* " \s*? )+ )
    -->
'''

matches = re.finditer(p, s)

for m in matches:
    print ("function: " + m.group("function"))
    for param in re.finditer(r'[a-z]+|"([^"]*)"', m.group("params")):
        if param.group(1):
            print (" value: " + param.group(1) + "\n")
        else:
            print (" key:   " + param.group())

Answer 4

不幸的是，python不允许使用递归正则表达式你可以这样做：

import re
string = '''<!--#include file="/var/www/localhost/index.html" set="one" set2="two" -->'''
regexString = '''<!--\#(?P<tag>\w+)\s(?P<name>\w+)="(?P<value>.*?")\s(?P<keyVal>.*)\s-->'''
regex = re.compile(regexString)
match = regex.match(string)
tag = match.group('tag')
name = match.group('name')
value = match.group('value')
keyVal = match.group('keyVal').split()
for item in keyVal:
    key, val in item.split('=')
    # You can now do whatever you want with the key=val pair

Answer 5

regex 库允许捕获重复的组（而内置的 re 不允许）。这提供了一个简单的解决方案，无需外部 for 循环来解析组。

import regex

string = r'<!--#include file="/var/www/localhost/index.html" set="one" -->'
rgx = regex.compile(
    r'<!--#(?<fun>[a-z]+)(\s+(?<key>[a-z]+)\s*=\s*"(?<val>[^"]*)")+')

match = rgx.match(string)
keys, values = match.captures('key', 'val')
print(match['fun'], *map(' = '.join, zip(keys, values)), sep='\n  ')

给你你想要的

include
  file = /var/www/localhost/index.html
  set = one

使用RegEx在Python中捕获重复组（参见示例）

5 个答案: