我想通过添加一个特殊元素<question>
来修改几个xml文件(语言语料库),以便能够更轻松地对表示问题的字符串进行操作。
以下是我拥有的xml文件示例:
<Turn speaker="spk2" startTime="4836.047" endTime="4840.004">
<Sync time="4836.047"/>
some text
<Sync time="4837.199"/>
first question ?
</Turn>
<Turn speaker="spk1" startTime="4840.004" endTime="4840.768">
<Sync time="4840.004"/>
text
<Event desc="rire" type="noise" extent="instantaneous"/>
</Turn>
<Turn speaker="spk2" startTime="4840.768" endTime="4846.534">
second question ?
<Sync time="4840.768"/>
third question? fourth question ? text
</Turn>
我想要的结果:
<Turn speaker="spk2" startTime="4836.047" endTime="4840.004"><question>
<Sync time="4836.047"/>
some text
<Sync time="4837.199"/>
first question ?</question>
</Turn>
<Turn speaker="spk1" startTime="4840.004" endTime="4840.768">
<Sync time="4840.004"/>
text
<Event desc="rire" type="noise" extent="instantaneous"/>
</Turn>
<Turn speaker="spk2" startTime="4840.768" endTime="4846.534"><question>
second question ?</question><question>
<Sync time="4840.768"/>
third question?</question><question> fourth question ?</question> text
</Turn>
基本上,它必须用?</question>
替换每个问号,然后在文本中找到 另一个?</question>
或元素<Turn>
,然后在此处添加开头<question>
。
第一个元素还包含字符串&#34;一些文字&#34;,但这是我想要的,因为我无论如何都无法找到问题的开头。
我真的更喜欢使用python,因为之后我将不得不使用lxml库。而且我还希望保留原始文件中的换行符数量。
我试着用正则表达式做到这一点,但它看起来有点复杂,因为我必须考虑换行并且还要重叠,除了有几个组。我想出了以下正则表达式,但是抓得太多了:
(</question>|<Turn.*>)([\s\S]*</question>)
我也尝试过在字符串上使用for循环,但是对于python和编程来说是一种新的东西我总是无法实现我想要的东西。
答案 0 :(得分:0)
import re
# Using re.split with grouping parens so the separators get returned in the split results
chunk = re.split(r'(<Turn.*?>|\?)', original_text)
# The "<Turn...>" and "?" separators are at the odd indexes. These are the
# places that '<question>' and '</question>' need to be inserted.
for i in range(1, len(chunk), 2):
if chunk[i] == '?':
chunk[i-2] += '<question>'
chunk[i] += '</question>'
new_text = ''.join(chunk)
print(new_text)