我正在尝试解析XML文件(更精确地说,它是XLIFF转换文件),并将其转换为(略有不同)TMX格式。
我的XLIFF源文件如下:
<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
<file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
<header>
<count-group name="SomeFile.strings">
<count count-type="total" unit="word">2</count>
</count-group>
</header>
<body>
<trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
</body>
</file>
<file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
<header>
<count-group name="SomeOtherFile.strings">
<count count-type="total" unit="word">4</count>
</count-group>
</header>
<body>
<trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
</body>
</file>
(and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)
</xliff>
我的目标是稍微重新排列和简化这些内容,以使上面的内容变成以下格式:
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>End</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Kraj</seg>
</tuv>
</tu>
<tu>
<prop type="FileSource">SomeOtherFile.strings</prop>
<tuv xml:lang="en">
<seg>return to project list</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is again a note</prop></prop>
<seg>povratak na popis projekata</seg>
</tuv>
</tu>
请注意,原始XLIFF文件可能包含多个<file origin ...>
部分,每个部分包含许多<trans-unit ...>
元素(是该文件中的实际字符串...)
我设法编写了可以让我“源”和“目标”部分确定的部分,但是我仍然需要的是来自“文件来源”元素的部分……其中定义了语言(即“源语言”和“目标语言”,然后我将它们分别写为<tuv xml:lang="en">
和<tuv xml:lang="hr">
(对于每个字符串),并在其中可以找到对字符串文件的相关引用(即“ SomeFile。字符串”和“ SomeOtherFile.strings”(用作<prop type="FileSource">SomeFile.strings</prop>
)。
当前,我有以下Python代码,可以很好地提取所需的“源”和“目标”元素:
#!/usr/bin/env python3
#
import sys
from lxml import etree
if len(sys.argv) < 2:
print('Wrong number of arguments:\n => You need to provide a filename for processing!')
exit()
file = sys.argv[1]
tree = etree.iterparse(file)
for action, elem in tree:
if elem.tag == "source":
print("<TransUnit>")
print("\t<Source>" + elem.text + "</Source>")
elif elem.tag == "target":
print("\t<Target>" + elem.text + "</Target>")
elif elem.tag == "note":
if elem.text is not None:
print("\t<Note>" + elem.text + "</Note>")
print("</TransUnit>")
else:
print("</TransUnit>")
else:
next
现在,我又如何从中提取“源语言”(即值“ en”),“目标语言”(即值“ hr”)和文件引用(即“ SomeFile.strings”)? XLIFF文件中的“文件来源....”元素?
此外,我需要保持(记住)该文件引用,即:
<prop type="FileSource">SomeOtherFile.strings</prop>
<tu>
)单元(可以有很多,与上面的示例不同,其中每个“文件”只有一个
例如,我会:
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>End</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Kraj</seg>
</tuv>
</tu>
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>Start</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Početak</seg>
</tuv>
</tu>
<tu>
元素都有一个<prop type="FileSource">
元素,显示它来自哪个文件... 在此方面,我将不胜感激。
答案 0 :(得分:0)
嘿,这是经常发生的事,我经过进一步的挖掘后才得出可用的解决方案... 也许我的问题不必要地复杂,而实际上却是确定适当的根元素,以及对子孙的正确寻址(和定位)。
无论如何,另一个stackoverflow线程使我走上了正确的道路,所以现在适合我的解决方案如下所示:
#!/usr/bin/env python3
#
import sys
import os
from lxml import etree
if len(sys.argv) < 2:
print('Wrong number of arguments:\n => You need to provide a filename for processing!')
exit()
file = sys.argv[1]
tree = etree.parse(file)
root = tree.getroot()
print("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE tmx SYSTEM \"tmx14.dtd\">\n<tmx version=\"1.4\">")
print("\n<header srclang=\"en\" creationtool=\"XLIFF to TMX\" datatype=\"unknown\" adminlang=\"en\" segtype=\"sentence\" creationtoolversion=\"1.0\">")
print("</header>\n<body>")
for element in root:
FileOrigin = (os.path.basename(element.attrib['origin']))
Product = element.attrib['product']
Source = element.attrib['source-language']
Target = element.attrib['target-language']
# now the children
for all_tags in element.findall('.//'):
if all_tags.tag == "source":
# replacing some troublesome and unnecessary codes
srctxt = all_tags.text
srctxt = srctxt.replace('^n', ' ')
srctxt = srctxt.replace('^b', ' ')
print("<tu>")
print("\t<prop type=\"FileSource\">" + FileOrigin + "</prop>")
print("\t<tuv xml:lang=\"" + Source + "\">")
print("\t\t<seg>" + srctxt + "</seg>")
elif all_tags.tag == "target":
# replacing the same troublesome and unnecessary codes
targtxt = all_tags.text
targtxt = targtxt.replace('^n', ' ')
targtxt = targtxt.replace('^b', ' ')
print("\t<tuv xml:lang=\"" + Target + "\">")
print("\t\t<seg>" + targtxt + "</seg>")
elif all_tags.tag == "note":
if all_tags.text is not None:
print("\t\t<prop type=\"Note\">" + all_tags.text.replace('^n', ' ') + "</prop>")
print("</tu>")
else:
print("</tu>")
else:
next
print("</body>\n</tmx>")
可能会整理一下并添加一些铃铛,但是总的来说,这解决了我原来的问题。也许它可以帮助其他尝试执行xliff解析的人...
答案 1 :(得分:0)
import xml.etree.cElementTree as ET
tree=ET.ElementTree(file='inputfile.xlf')
root=tree.getroot()
for tag in root.findall('file'):
t_value = tag.get('target-language')
for tag in root.findall('file'):
s_value = tag.get('source-language')