我想合并两个XML文件(实际上是翻译记忆库TMX文件),但要注意两个文件可以包含一个条目的两个版本。文件格式是这样的:
file1.tmx
:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
<header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="4.1.5_0_10418" segtype="paragraph" srclang="ZH-CN"/>
<body>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated1</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190119T074530Z">
<seg>I is John.</seg>
</tuv>
</tu>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated2</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190119T075550Z">
<seg>Other Entry 2</seg>
</tuv>
</tu>
</body>
</tmx>
file2.tmx
:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
<header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="4.1.5_0_10418" segtype="paragraph" srclang="ZH-CN"/>
<body>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated1</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190415T064114Z">
<seg>I am John.</seg>
</tuv>
</tu>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated3</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190119T074550Z">
<seg>Other Entry 3</seg>
</tuv>
</tu>
</body>
</tmx>
所需的输出:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
<header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="4.1.5_0_10418" segtype="paragraph" srclang="ZH-CN"/>
<body>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190415T064114Z">
<seg>I am John.</seg>
</tuv>
</tu>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated2</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190119T075550Z">
<seg>Other Entry 2</seg>
</tuv>
</tu>
<tu>
<tuv lang="ZH-CN">
<seg>SourceEntryToBeTranslated3</seg>
</tuv>
<tuv lang="EN-US" changeid="user" changedate="20190119T074550Z">
<seg>Other Entry 3</seg>
</tuv>
</tu>
</body>
</tmx>
我可以使用xml.etree.ElementTree
来合并文件,但是这取决于我合并文件的顺序,从而产生了不同的版本。请注意,元数据changedate
实际上包括一个日期时间戳。
我不确定合并时如何解析元素树以实际采用“最新”条目版本。
from xml.etree import ElementTree
def run(files):
first = None
for filename in files:
data = ElementTree.parse(filename).getroot()
if first is None:
first = data
else:
first.extend(data)
if first is not None:
print(ElementTree.tostring(first,encoding='UTF-8'))