Question

我希望能够使用python regexp捕获HTML属性的值。目前我用

re.compile( r'=(["\'].*?["\'])', re.IGNORECASE | re.DOTALL )

我的问题是我希望正则表达式“记住”该属性是以单引号还是双引号开头。

我使用以下属性

找到了当前方法中的错误

href="javascript:foo('bar')"

我的正则表达式

"javascript:foo('

Answer 1

您可以捕获第一个引用，然后使用反向引用：

r'=((["\']).*?\2)'

但是，正则表达式为not the proper approach to parsing HTML。您应该考虑使用DOM解析器。

Answer 2

以下在理论上会更有效率：

regex = r'"[^"]*"|\'[^']*\''

作为参考，这里是杰弗里弗里德的expression html标签（来自猫头鹰书）：

<              # Opening "<"
  (            #    Any amount of . . . 
     "[^"]*"   #      double-quoted string,
     |         #      or . . . 
     '[^']*'   #      single-quoted string,
     |         #      or . . . 
     [^'">]    #      "other stuff"
  )*           #
>              # Closing ">"

正则表达式匹配起始子句与结束

2 个答案: