如果发生冲突,则借助元数据(日期)合并两个XML文件

时间:2019-04-19 09:28:46

标签: python xml python-3.x lxml xml.etree

我想合并两个XML文件(实际上是翻译记忆库TMX文件),但要注意两个文件可以包含一个条目的两个版本。文件格式是这样的: file1.tmx

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
  <header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="4.1.5_0_10418" segtype="paragraph" srclang="ZH-CN"/>
  <body>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated1</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190119T074530Z">
        <seg>I is John.</seg>
      </tuv>
    </tu>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated2</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190119T075550Z">
        <seg>Other Entry 2</seg>
      </tuv>
    </tu>
  </body>
</tmx>

file2.tmx

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
  <header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="4.1.5_0_10418" segtype="paragraph" srclang="ZH-CN"/>
  <body>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated1</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190415T064114Z">
        <seg>I am John.</seg>
      </tuv>
    </tu>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated3</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190119T074550Z">
        <seg>Other Entry 3</seg>
      </tuv>
    </tu>
  </body>
</tmx>

所需的输出:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE tmx SYSTEM "tmx11.dtd">
<tmx version="1.1">
  <header creationtool="OmegaT" o-tmf="OmegaT TMX" adminlang="EN-US" datatype="plaintext" creationtoolversion="4.1.5_0_10418" segtype="paragraph" srclang="ZH-CN"/>
  <body>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190415T064114Z">
        <seg>I am John.</seg>
      </tuv>
    </tu>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated2</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190119T075550Z">
        <seg>Other Entry 2</seg>
      </tuv>
    </tu>
    <tu>
      <tuv lang="ZH-CN">
        <seg>SourceEntryToBeTranslated3</seg>
      </tuv>
      <tuv lang="EN-US" changeid="user" changedate="20190119T074550Z">
        <seg>Other Entry 3</seg>
      </tuv>
    </tu>
  </body>
</tmx>

我可以使用xml.etree.ElementTree来合并文件,但是这取决于我合并文件的顺序,从而产生了不同的版本。请注意,元数据changedate实际上包括一个日期时间戳。

我不确定合并时如何解析元素树以实际采用“最新”条目版本。

from xml.etree import ElementTree
def run(files):
    first = None
    for filename in files:
        data = ElementTree.parse(filename).getroot()
        if first is None:
            first = data
        else:
            first.extend(data)
    if first is not None:
        print(ElementTree.tostring(first,encoding='UTF-8'))

0 个答案:

没有答案