Question

我有以下字符串，我想提取元素（xx =“yy”）以及括号之间的内容。这是一个例子：

[caption id =“get this”align =“and this”width =“and this”caption =“and 这个“这也请[/ caption]

我尝试过以下代码，但我是一个正则表达式的noob。

re.sub(r'\[caption id="(.*)" align="(.*)" width="(.*)" caption="(.*)"\](.*)\[\/caption\]', "tokens: %1 %2 %3 %4 %5", self.content, re.IGNORECASE)

提前多多感谢！

Answer 1

它可能不适合你，因为.*是贪婪的。在其位置尝试[^"]*。 [^"]表示除引号字符外的所有字符集。此外，正如您在评论中指出的，令牌语法是\\n，而不是%n。试试这个：

re.sub(r'\[caption id="([^"]*)" align="([^"]*)" width="([^"]*)" caption="([^"]*)"\](.*)\[\/caption\]', "tokens: \\1 \\2 \\3 \\4 \\5", self.content, re.IGNORECASE)

标题标记的内容是否跨越多行？如果他们.*不会捕获换行符。你需要我们[^\x00]*之类的东西。 [^\x00]表示除空字符外的所有charchters的集合。

re.sub(r'\[caption id="([^"]*)" align="([^"]*)" width="([^"]*)" caption="([^"]*)"\]([^\x00]*)\[\/caption\]', "tokens: \\1 \\2 \\3 \\4 \\5", self.content, re.IGNORECASE)

如果您的字符串实际上合法地包含空字符，则需要使用re.DOTALL标志。

Answer 2

您可以利用Python标准SGML / HTML / XML解析模块的强大功能：如果将“[]”替换为“＆lt;＆gt;”是安全的，那么您可以执行此替换以生成有效的XML，并使用标准库XML解析函数进行解析：

import string
from xml.etree import ElementTree as ET

text = '[caption id="get this" align="and this" width="and this" caption="and this"]this too please[/caption]'
xml_text = string.translate(text, string.maketrans('[]', '<>'))  # Conversion to XML
parsed_text = ET.fromstring(xml_text)  # Parsing

# Extracted information
print "Text part:", parsed_text.text
print "Values:", parsed_text.attrib.values()

这是正确的打印：

Text part: this too please
Values: ['and this', 'and this', 'get this', 'and this']

这种方法的优点是（1）它使用了许多人都知道的标准模块; （2）明确表明你想做什么; （3）您可以轻松提取更多信息，处理更复杂的值（包括包含双引号的值......）等。

Answer 3

你可以试试这样的东西吗？

re = '[caption id="get this" align="and this" width="and this" caption="and this"]this too please[/caption]'
re.gsub(/([a-z]*)=\"(.*?)\"/i) do |m|
    puts "#{$1} = #{$2}
end

提取括号内和支架之间的元素

3 个答案: