您好我将wordpress博客导出到其他CMS,我需要在上传到新平台之前从html中删除打开和关闭[caption]标签及其内容,而不删除标记&# 39;包含在内。其余的代码在这里是https://github.com/thmcmahon/wp2nb。
理想情况下,我希望将其实现为如下函数:
def strip_caption_tags(content):
no_captions = do_some_stuff_presumably_regex(content)
return caption
这是数据的一个例子:
<![CDATA[[caption id="attachment_5582" align="alignleft" width="1024" caption="Out on Lake Burley Griffin with members of the Canberra Ice Dragons Paddle Club, January 2014"]<a href="http://www.andrewleigh.com/blog/wp-content/uploads/2014/01/ACT-Dragon-Boat-3.jpg"><img class="size-large wp-image-5582" title="ACT Dragon Boat 3" src="http://www.andrewleigh.com/blog/wp-content/uploads/2014/01/ACT-Dragon-Boat-3-1024x682.jpg" alt="" width="1024" height="682" /></a>[/caption]
<div class="mceTemp"><strong>Ca</strong><strong>l</strong><span style="font-weight: bold;">l for Local Sporting Champions to step up and apply for grants on offer</span></div>
Young people can find it difficult to meet the ongoing and significant costs associated with participation at sporting competitions.
The Local Sporting Champions program is designed to provide financial assistance for young people towards the cost of travel, accommodation, uniforms or equipment when competing, coaching or officiating at an official sports event.
For more information on the Local Sporting Champions program visit the Australian Sports Commission website: <a href="http://www.ausport.gov.au/champions">www.ausport.gov.au/champions</a>.]]>
答案 0 :(得分:2)
这是您的问题的答案,但我不是100%确定您提出有关转换数据的正确问题。在将数据库导出到XML之前,这可能更容易处理,但是如果你想用python中的regex替换内容:
import re
contents = //... get your post contents here
contents = re.sub(r'\[/?caption[^\]]*?\]', '', contents)
对于正则表达式:
\[
匹配文字左方括号[
/?
可选地匹配正斜杠/
caption
匹配caption
[^\]]*?
对于不是右方括号]
的任何字符的惰性匹配\]
匹配文字右方括号这将匹配[caption foo="bar"]
以及[/caption]
。
使用您的示例here on Regex101及其他说明查看其实际操作。