Question

我真的是一个正则表达式的菜鸟，我试着自己这样做，但我无法从手册中了解如何处理它。我试图找到给定内容的所有img标签，我写下面但是它返回无

            content = i.content[0].value
            prog = re.compile(r'^<img')
            result = prog.match(content)
            print result

有什么建议吗？

Answer 1

多用途解决方案：

image_re = re.compile(r"""
    (?P<img_tag><img)\s+    #tag starts
    [^>]*?                  #other attributes
    src=                    #start of src attribute
    (?P<quote>["''])?       #optional open quote
    (?P<image>[^"'>]+)      #image file name
    (?(quote)(?P=quote))    #close quote
    [^>]*?                  #other attributes
    >                       #end of tag
    """, re.IGNORECASE|re.VERBOSE) #re.VERBOSE allows to define regex in readable format with comments

image_tags = []
for match in image_re.finditer(content):
    image_tags.append(match.group("img_tag"))

#print found image_tags
for image_tag in image_tags:
    print image_tag

正如您在正则表达式定义中所看到的，它包含

(?P<group_name>regex)

它允许您按group_name访问找到的群组，而不是按号码访问。这是为了便于阅读。因此，如果您想显示src标签的所有img属性，请写下：

for match in image_re.finditer(content):
    image_tags.append(match.group("image"))

此image_tags列表将包含src图像标记。

此外，如果您需要解析html，那么有些工具专为此目的而设计。例如，它是lxml，使用xpath表达式。

Answer 2

我不知道Python，但假设它使用普通的Perl兼容正则表达式......

您可能想要查找“＆lt; img [^＆gt;] +＆gt;”这是：“＆lt; img”，后跟任何不是“＆gt;”的东西，后跟“＆gt;”。每个匹配应该为您提供完整的图像标记。

python使用正则表达式匹配来自大型内容字符串的图像标记

2 个答案: