我的文字格式如下:
<cast_member billing="top">
<display_name>Elijah Wood</display_name>
<character_name>#9 (voice)</character_name>
<locales>
<locale name="ko-KR">
<display_name>일라이자 우드</display_name>
</locale>
<locale name="cmn-Hant">
<display_name>伊利亞伍德</display_name>
</locale>
</locales>
</cast_member>
<cast_member billing="top">
<display_name>Peter Pan</display_name>
<character_name>#8 (voice)</character_name>
</cast_member>
如果<locales>
标记存在,我将如何删除 <cast_member billing="top">
<display_name>Elijah Wood</display_name>
<character_name>#9 (voice)</character_name>
</cast_member>
<cast_member billing="top">
<display_name>Peter Pan</display_name>
<character_name>#8 (voice)</character_name>
</cast_member>
标记内的所有内容。上面的输入看起来像:
{{1}}
答案 0 :(得分:1)
永远不要使用正则表达式来解析HTML或XML。请改用优秀的lxml库。
答案 1 :(得分:1)
这将在没有Regex的纯Python中完成,但它可能会破坏缩进和/或在文本被删除的地方留下空行
<cast_member billing="top">
<display_name>Elijah Wood</display_name>
<character_name>#9 (voice)</character_name>
</cast_member>
<cast_member billing="top">
<display_name>Peter Pan</display_name>
<character_name>#8 (voice)</character_name>
</cast_member>
这是代码:
with open('data') as f:
text = f.read()
oTag = "<locales>"
cTag = "</locales>"
newText = ''
p = 0
s = text.find(oTag, p)
while s > -1:
e = text.find(cTag, s)
if e == -1:
# ERROR: no closing tag
pass
newText += text[p:s]
p = e + len(cTag)
s = text.find(oTag, p)
newText += text[p:]
print newText,
答案 2 :(得分:0)
您可以使用正则表达式和正则表达式替换函数
“string”.replace(/ s /,'') - &gt; “特林”
你可以创建一个看起来像这样的正则表达式: /(\\+.+){0,}</locales>/ - &gt;这将匹配open和close语言环境标记,以及它们之间的任何内容。
http://rubular.com/r/WTfo0b2bet看到它的实际效果
myXMLstring.replace(/(\ s +。+){0,}&lt; / locales&gt; /,'')
答案 3 :(得分:0)
以下是我最终要做的事情,使用lxml:
cast_name = node.xpath("//package/video/cast/cast_member/display_name")
character_name = node.xpath("//package/video/cast/cast_member/character_name")
combined_cast = zip(cast_name, character_name)
cast = [(item1.text, item2.text) for item1, item2 in combined_cast]
[(Elijah Wood,#9 (voice)), (Peter Pan, #8 (voice))]