Question

我正在尝试解析XML文件（更精确地说，它是XLIFF转换文件），并将其转换为（略有不同）TMX格式。

我的XLIFF源文件如下：

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
    <header>
      <count-group name="SomeFile.strings">
        <count count-type="total" unit="word">2</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
    </body>
  </file>
  <file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
    <header>
      <count-group name="SomeOtherFile.strings">
        <count count-type="total" unit="word">4</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
    </body>
  </file>

  (and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)

  </xliff>

我的目标是稍微重新排列和简化这些内容，以使上面的内容变成以下格式：

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeOtherFile.strings</prop>
    <tuv xml:lang="en">
        <seg>return to project list</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is again a note</prop></prop>
        <seg>povratak na popis projekata</seg>
    </tuv>
</tu>

请注意，原始XLIFF文件可能包含多个<file origin ...>部分，每个部分包含许多<trans-unit ...>元素（是该文件中的实际字符串...）

我设法编写了可以让我“源”和“目标”部分确定的部分，但是我仍然需要的是来自“文件来源”元素的部分……其中定义了语言（即“源语言”和“目标语言”，然后我将它们分别写为<tuv xml:lang="en">和<tuv xml:lang="hr">（对于每个字符串），并在其中可以找到对字符串文件的相关引用（即“ SomeFile。字符串”和“ SomeOtherFile.strings”（用作<prop type="FileSource">SomeFile.strings</prop>）。

当前，我有以下Python代码，可以很好地提取所需的“源”和“目标”元素：

#!/usr/bin/env python3
#

import sys

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.iterparse(file)
for action, elem in tree:
    if elem.tag == "source":
        print("<TransUnit>")
        print("\t<Source>" + elem.text  + "</Source>")
    elif elem.tag == "target":
        print("\t<Target>" + elem.text + "</Target>")
    elif elem.tag == "note":
        if elem.text is not None:
            print("\t<Note>" + elem.text + "</Note>")
            print("</TransUnit>")
        else: 
            print("</TransUnit>")
    else:
        next

现在，我又如何从中提取“源语言”（即值“ en”），“目标语言”（即值“ hr”）和文件引用（即“ SomeFile.strings”）？ XLIFF文件中的“文件来源....”元素？

此外，我需要保持（记住）该文件引用，即：

<prop type="FileSource">SomeOtherFile.strings</prop>

用于属于该文件的所有个翻译（<tu>）单元（可以有很多，与上面的示例不同，其中每个“文件”只有一个

例如，我会：

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>Start</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Početak</seg>
    </tuv>
</tu>

每个<tu>元素都有一个<prop type="FileSource">元素，显示它来自哪个文件...

在此方面，我将不胜感激。

Answer 1

嘿，这是经常发生的事，我经过进一步的挖掘后才得出可用的解决方案... 也许我的问题不必要地复杂，而实际上却是确定适当的根元素，以及对子孙的正确寻址（和定位）。

无论如何，另一个stackoverflow线程使我走上了正确的道路，所以现在适合我的解决方案如下所示：

#!/usr/bin/env python3
#

import sys
import os

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.parse(file)
root = tree.getroot()

print("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE tmx SYSTEM \"tmx14.dtd\">\n<tmx version=\"1.4\">")
print("\n<header srclang=\"en\" creationtool=\"XLIFF to TMX\" datatype=\"unknown\" adminlang=\"en\" segtype=\"sentence\" creationtoolversion=\"1.0\">")
print("</header>\n<body>")

for element in root:
    FileOrigin = (os.path.basename(element.attrib['origin']))
    Product = element.attrib['product']
    Source = element.attrib['source-language']
    Target =  element.attrib['target-language']
    # now the children
    for all_tags in element.findall('.//'):
        if all_tags.tag == "source":
            # replacing some troublesome and unnecessary codes
            srctxt = all_tags.text
            srctxt = srctxt.replace('^n', ' ')
            srctxt = srctxt.replace('^b', ' ')
            print("<tu>")
            print("\t<prop type=\"FileSource\">" + FileOrigin + "</prop>")
            print("\t<tuv xml:lang=\"" + Source + "\">")
            print("\t\t<seg>" + srctxt + "</seg>")
        elif all_tags.tag == "target":
            # replacing the same troublesome and unnecessary codes
            targtxt = all_tags.text
            targtxt = targtxt.replace('^n', ' ')
            targtxt = targtxt.replace('^b', ' ')
            print("\t<tuv xml:lang=\"" + Target + "\">")
            print("\t\t<seg>" + targtxt + "</seg>")
        elif all_tags.tag == "note":
            if all_tags.text is not None:
                print("\t\t<prop type=\"Note\">" + all_tags.text.replace('^n', ' ') + "</prop>")
                print("</tu>")
            else: 
                print("</tu>")
        else:
            next
print("</body>\n</tmx>")

可能会整理一下并添加一些铃铛，但是总的来说，这解决了我原来的问题。也许它可以帮助其他尝试执行xliff解析的人...

Answer 2

import xml.etree.cElementTree as ET

tree=ET.ElementTree(file='inputfile.xlf')

root=tree.getroot()

for tag in root.findall('file'):
    t_value = tag.get('target-language')

for tag in root.findall('file'):
    s_value = tag.get('source-language')

Python：解析包含标头的XML（xliff）文件

2 个答案: