系统地删除一段文本(正则表达式?)

时间:2012-08-10 20:34:23

标签: python regex

我的文字格式如下:

        <cast_member billing="top">
            <display_name>Elijah Wood</display_name>
            <character_name>#9 (voice)</character_name>
            <locales>
                <locale name="ko-KR">
                    <display_name>일라이자 우드</display_name>
                </locale>
                <locale name="cmn-Hant">
                    <display_name>伊利亞伍德</display_name>
                </locale>
            </locales>
        </cast_member>
        <cast_member billing="top">
            <display_name>Peter Pan</display_name>
            <character_name>#8 (voice)</character_name>
        </cast_member>

如果<locales>标记存在,我将如何删除 <cast_member billing="top"> <display_name>Elijah Wood</display_name> <character_name>#9 (voice)</character_name> </cast_member> <cast_member billing="top"> <display_name>Peter Pan</display_name> <character_name>#8 (voice)</character_name> </cast_member> 标记内的所有内容。上面的输入看起来像:

{{1}}

4 个答案:

答案 0 :(得分:1)

永远不要使用正则表达式来解析HTML或XML。请改用优秀的lxml库。

答案 1 :(得分:1)

这将在没有Regex的纯Python中完成,但它可能会破坏缩进和/或在文本被删除的地方留下空行

<cast_member billing="top">
    <display_name>Elijah Wood</display_name>
    <character_name>#9 (voice)</character_name>

</cast_member>
<cast_member billing="top">
    <display_name>Peter Pan</display_name>
    <character_name>#8 (voice)</character_name>
</cast_member>

这是代码:

with open('data') as f:
    text = f.read()

oTag = "<locales>"
cTag = "</locales>"

newText = ''
p = 0
s = text.find(oTag, p)
while s > -1:
    e = text.find(cTag, s)
    if e == -1:
        # ERROR: no closing tag
        pass
    newText += text[p:s]
    p = e + len(cTag)
    s = text.find(oTag, p)
newText += text[p:]

print newText,

答案 2 :(得分:0)

您可以使用正则表达式和正则表达式替换函数

“string”.replace(/ s /,'') - &gt; “特林”

你可以创建一个看起来像这样的正则表达式: /(\\+.+){0,}</locales>/ - &gt;这将匹配open和close语言环境标记,以及它们之间的任何内容。

http://rubular.com/r/WTfo0b2bet看到它的实际效果

myXMLstring.replace(/(\ s +。+){0,}&lt; / locales&gt; /,'')

答案 3 :(得分:0)

以下是我最终要做的事情,使用lxml:

cast_name = node.xpath("//package/video/cast/cast_member/display_name")
character_name = node.xpath("//package/video/cast/cast_member/character_name")
combined_cast = zip(cast_name, character_name)
cast = [(item1.text, item2.text) for item1, item2 in combined_cast]

[(Elijah Wood,#9 (voice)), (Peter Pan, #8 (voice))]