Python正则表达式删除标题

时间:2017-06-05 22:09:25

标签: python regex

我从网页抓取了以下文字:

  一名女性被认为是25岁左右,她被俘虏了   2011年在马来西亚婆罗洲野外生活,并度过了余生   沙巴的Tabin野生动物保护区,由BORA管理的围栏设施。   [caption id =“attachment_194682”align =“aligncenter”width =“768”]犀牛   在婆罗洲犀牛保护区得到照顾。照片由沙巴提供   野生动物部。[/ caption]总体而言,在55到100之间   据信濒临灭绝的物种仍然存在

您可以看到标题的方括号之间有文字。基本上,我想删除方括号和句子之间的所有内容

  

Rhino在婆罗洲犀牛保护区受到照顾。照片由...提供   沙巴野生动物部门

因为它是图像的标题。所以结果应该是:

  一名女性被认为是25岁左右,她被俘虏了   2011年在马来西亚婆罗洲野外生活,并度过了余生   沙巴的Tabin野生动物保护区,由BORA管理的围栏设施。   总体而言,55至100种极度濒危物种是   相信仍然......

我该怎么做?

1 个答案:

答案 0 :(得分:0)

您可以使用python re模块在​​标题之间获取数据,如下所示:

import re
text = """
A female believed to be around 25 years old, she was captured in the wild in Malaysian Borneo in 2011 and lived the rest of her life at the Tabin Wildlife Reserve in Sabah, a fenced-in facility managed by BORA. [caption id="attachment_194682" align="aligncenter" width="768"] Rhino being cared for at the Borneo rhino sanctuary. Photo courtesy of Sabah Wildlife Department.[/caption] Overall, between 55 and 100 of the Critically Endangered species are believed to remain
"""

pattern = r'\[caption.*\](.*)\[/caption\]'
items = re.search(pattern, text)
print text.replace(items.group(0), '')

#  A female believed to be around 25 years old, she was captured in the wild in Malaysian Borneo in 2011 and lived the rest of her life at the Tabin Wildlife Reserve in Sabah, a fenced-in facility managed by BORA.  Overall, between 55 and 100 of the Critically Endangered species are believed to remain