Python在xml文件中查找并替换问题

时间:2016-03-16 21:27:48

标签: python xml replace

我想通过添加一个特殊元素<question>来修改几个xml文件(语言语料库),以便能够更轻松地对表示问题的字符串进行操作。

以下是我拥有的xml文件示例:

<Turn speaker="spk2" startTime="4836.047" endTime="4840.004">
<Sync time="4836.047"/>
some text
<Sync time="4837.199"/>
first question ?
</Turn>
<Turn speaker="spk1" startTime="4840.004" endTime="4840.768">
<Sync time="4840.004"/>
text
<Event desc="rire" type="noise" extent="instantaneous"/>
</Turn>
<Turn speaker="spk2" startTime="4840.768" endTime="4846.534">
second question ?
<Sync time="4840.768"/>
third question? fourth question ? text
</Turn>

我想要的结果:

<Turn speaker="spk2" startTime="4836.047" endTime="4840.004"><question>
<Sync time="4836.047"/>
some text
<Sync time="4837.199"/>
first question ?</question>
</Turn>
<Turn speaker="spk1" startTime="4840.004" endTime="4840.768">
<Sync time="4840.004"/>
text
<Event desc="rire" type="noise" extent="instantaneous"/>
</Turn>
<Turn speaker="spk2" startTime="4840.768" endTime="4846.534"><question>
second question ?</question><question>
<Sync time="4840.768"/>
third question?</question><question> fourth question ?</question> text
</Turn>

基本上,它必须用?</question>替换每个问号,然后在文本中找到 另一个?</question> 元素<Turn>,然后在此处添加开头<question>

第一个元素还包含字符串&#34;一些文字&#34;,但这是我想要的,因为我无论如何都无法找到问题的开头。

我真的更喜欢使用python,因为之后我将不得不使用lxml库。而且我还希望保留原始文件中的换行符数量。

我试着用正则表达式做到这一点,但它看起来有点复杂,因为我必须考虑换行并且还要重叠,除了有几个组。我想出了以下正则表达式,但是抓得太多了:

(</question>|<Turn.*>)([\s\S]*</question>)

我也尝试过在字符串上使用for循环,但是对于python和编程来说是一种新的东西我总是无法实现我想要的东西。

1 个答案:

答案 0 :(得分:0)

import re

# Using re.split with grouping parens so the separators get returned in the split results
chunk = re.split(r'(<Turn.*?>|\?)', original_text)

# The "<Turn...>" and "?" separators are at the odd indexes. These are the
# places that '<question>' and '</question>' need to be inserted.
for i in range(1, len(chunk), 2):
    if chunk[i] == '?':
        chunk[i-2] += '<question>'
        chunk[i] += '</question>'

new_text = ''.join(chunk)

print(new_text)