在Python中的标题标记内解析文本

时间:2017-05-12 16:55:49

标签: python regex

我正在将一个wordpress博客迁移到Jekyll,并遇到了以下障碍:

我想解析

等文字

[caption id="attachment_1749417" align="aligncenter" width="426"][![femur head cross section](http://www.wired.com/wp-content/uploads/2015/03/femur-head-cross-section.png)](http://www.bartleby.com/107/illus247.html) A cross-section of the top of the thigh bone. ![](http://www.wired.com/wp-content/themes/Phoenix/assets/images/gallery-cam@2x.png) [Gray's Anatomy](http://www.bartleby.com/107/illus247.html) / Public Domain[/caption]

这样我就可以恢复标题标签之间的所有文字,即

[![femur head cross section](http://www.wired.com/wp-content/uploads/2015/03/femur-head-cross-section.png)](http://www.bartleby.com/107/illus247.html) A cross-section of the top of the thigh bone. ![](http://www.wired.com/wp-content/themes/Phoenix/assets/images/gallery-cam@2x.png) [Gray's Anatomy](http://www.bartleby.com/107/illus247.html) / Public Domain

我尝试了以下Python代码:

match = re.search("\[caption.*\](.*)\[\/caption\]",caption)
if match and len(match.groups()) > 0:
    actualcaption = match.groups()[0]
    print 'actual caption: '+ actualcaption

然而,这只会给我(http://www.bartleby.com/107/illus247.html) / Public Domain

任何帮助将不胜感激!感谢。

1 个答案:

答案 0 :(得分:1)

主要问题是

  • 您正在访问match.groups()[0],而您应该访问match.group(1),因为您捕获您需要的部分,并且该模式中有一对非转义括号,它们是唯一一对捕获括号,因此ID = 1。
  • 您正在使用.*使用贪婪量词,而您需要.*?匹配尽可能少的字符而不是换行符

注意:如果文字跨越多行,您还应该将re.DOTALLre.S传递给re.search,以便.可以匹配换行符。

请参阅regex demoPython demo

import re
regex = r"\[caption.*?](.*?)\[/caption]"
test_str = "[caption id=\"attachment_1749417\" align=\"aligncenter\" width=\"426\"][![femur head cross section](http://www.wired.com/wp-content/uploads/2015/03/femur-head-cross-section.png)](http://www.bartleby.com/107/illus247.html) A cross-section of the top of the thigh bone. ![](http://www.wired.com/wp-content/themes/Phoenix/assets/images/gallery-cam@2x.png) [Gray's Anatomy](http://www.bartleby.com/107/illus247.html) / Public Domain[/caption]"
match = re.search(regex, test_str)
if match:
    print(match.group(1))

打印:

[![femur head cross section](http://www.wired.com/wp-content/uploads/2015/03/femur-head-cross-section.png)](http://www.bartleby.com/107/illus247.html) A cross-section of the top of the thigh bone. ![](http://www.wired.com/wp-content/themes/Phoenix/assets/images/gallery-cam@2x.png) [Gray's Anatomy](http://www.bartleby.com/107/illus247.html) / Public Domain