Python:解析包含标头的XML(xliff)文件

时间:2019-05-10 15:45:13

标签: python lxml xliff

我正在尝试解析XML文件(更精确地说,它是XLIFF转换文件),并将其转换为(略有不同)TMX格式。

我的XLIFF源文件如下:

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
    <header>
      <count-group name="SomeFile.strings">
        <count count-type="total" unit="word">2</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
    </body>
  </file>
  <file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
    <header>
      <count-group name="SomeOtherFile.strings">
        <count count-type="total" unit="word">4</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
    </body>
  </file>

  (and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)

  </xliff>

我的目标是稍微重新排列和简化这些内容,以使上面的内容变成以下格式:

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeOtherFile.strings</prop>
    <tuv xml:lang="en">
        <seg>return to project list</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is again a note</prop></prop>
        <seg>povratak na popis projekata</seg>
    </tuv>
</tu>

请注意,原始XLIFF文件可能包含多个<file origin ...>部分,每个部分包含许多<trans-unit ...>元素(是该文件中的实际字符串...)

我设法编写了可以让我“源”和“目标”部分确定的部分,但是我仍然需要的是来自“文件来源”元素的部分……其中定义了语言(即“源语言”和“目标语言”,然后我将它们分别写为<tuv xml:lang="en"><tuv xml:lang="hr">(对于每个字符串),并在其中可以找到对字符串文件的相关引用(即“ SomeFile。字符串”和“ SomeOtherFile.strings”(用作<prop type="FileSource">SomeFile.strings</prop>)。

当前,我有以下Python代码,可以很好地提取所需的“源”和“目标”元素:

#!/usr/bin/env python3
#

import sys

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.iterparse(file)
for action, elem in tree:
    if elem.tag == "source":
        print("<TransUnit>")
        print("\t<Source>" + elem.text  + "</Source>")
    elif elem.tag == "target":
        print("\t<Target>" + elem.text + "</Target>")
    elif elem.tag == "note":
        if elem.text is not None:
            print("\t<Note>" + elem.text + "</Note>")
            print("</TransUnit>")
        else: 
            print("</TransUnit>")
    else:
        next

现在,我又如何从中提取“源语言”(即值“ en”),“目标语言”(即值“ hr”)和文件引用(即“ SomeFile.strings”)? XLIFF文件中的“文件来源....”元素?

此外,我需要保持(记住)该文件引用,即:

<prop type="FileSource">SomeOtherFile.strings</prop>
  • 用于属于该文件的所有个翻译(<tu>)单元(可以有很多,与上面的示例不同,其中每个“文件”只有一个

例如,我会:

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>Start</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Početak</seg>
    </tuv>
</tu>
  • 每个<tu>元素都有一个<prop type="FileSource">元素,显示它来自哪个文件...

在此方面,我将不胜感激。

2 个答案:

答案 0 :(得分:0)

嘿,这是经常发生的事,我经过进一步的挖掘后才得出可用的解决方案... 也许我的问题不必要地复杂,而实际上却是确定适当的根元素,以及对子孙的正确寻址(和定位)。

无论如何,另一个stackoverflow线程使我走上了正确的道路,所以现在适合我的解决方案如下所示:

#!/usr/bin/env python3
#

import sys
import os

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.parse(file)
root = tree.getroot()

print("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE tmx SYSTEM \"tmx14.dtd\">\n<tmx version=\"1.4\">")
print("\n<header srclang=\"en\" creationtool=\"XLIFF to TMX\" datatype=\"unknown\" adminlang=\"en\" segtype=\"sentence\" creationtoolversion=\"1.0\">")
print("</header>\n<body>")

for element in root:
    FileOrigin = (os.path.basename(element.attrib['origin']))
    Product = element.attrib['product']
    Source = element.attrib['source-language']
    Target =  element.attrib['target-language']
    # now the children
    for all_tags in element.findall('.//'):
        if all_tags.tag == "source":
            # replacing some troublesome and unnecessary codes
            srctxt = all_tags.text
            srctxt = srctxt.replace('^n', ' ')
            srctxt = srctxt.replace('^b', ' ')
            print("<tu>")
            print("\t<prop type=\"FileSource\">" + FileOrigin + "</prop>")
            print("\t<tuv xml:lang=\"" + Source + "\">")
            print("\t\t<seg>" + srctxt + "</seg>")
        elif all_tags.tag == "target":
            # replacing the same troublesome and unnecessary codes
            targtxt = all_tags.text
            targtxt = targtxt.replace('^n', ' ')
            targtxt = targtxt.replace('^b', ' ')
            print("\t<tuv xml:lang=\"" + Target + "\">")
            print("\t\t<seg>" + targtxt + "</seg>")
        elif all_tags.tag == "note":
            if all_tags.text is not None:
                print("\t\t<prop type=\"Note\">" + all_tags.text.replace('^n', ' ') + "</prop>")
                print("</tu>")
            else: 
                print("</tu>")
        else:
            next
print("</body>\n</tmx>")

可能会整理一下并添加一些铃铛,但是总的来说,这解决了我原来的问题。也许它可以帮助其他尝试执行xliff解析的人...

答案 1 :(得分:0)

import xml.etree.cElementTree as ET

tree=ET.ElementTree(file='inputfile.xlf')

root=tree.getroot()

for tag in root.findall('file'):
    t_value = tag.get('target-language')

for tag in root.findall('file'):
    s_value = tag.get('source-language')