解析Python

时间:2015-06-16 11:25:18

标签: python xml parsing elementtree

我的Python解析存在问题。我有这种xml文件:

fr-FR

这是我的Python代码:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE Trans SYSTEM "trans-14.dtd">
<Trans scribe="maria" audio_filename="agora_2007_11_05_a" version="11" version_date="080826" xml:lang="catalan">
<Topics>
<Topic id="to1" desc="music"/>
<Topic id="to2" desc="bgnoise"/>
<Topic id="to4" desc="silence"/>
<Topic id="to5" desc="speech"/>
<Topic id="to6" desc="speech+music"/>
</Topics>
<Speakers>
<Speaker id="spk1" name="Xavi Coral" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk2" name="Ferran Martínez" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
<Speaker id="spk3" name="Jordi Barbeta" check="no" type="male" dialect="native" accent="catalan" scope="local"/>
</Speakers>
<Section type="report" topic="to6" startTime="111.286" endTime="119.308">
<Turn speaker="spk1" startTime="111.286" endTime="119.308" mode="planned" channel="studio">
<Sync time="111.286"/>
ha estat director del diari La Vanguàrdia,
<Sync time="113.56"/>
ha estat director general de Barcelona Televisió i director del Centre Territorial de Televisió Espanyola a Catalunya,
<Sync time="119.308"/>
actualment col·labora en el diari 
<Event desc="es" type="language" extent="begin"/>
El Periódico
<Event desc="es" type="language" extent="end"/>
de Catalunya.
</Turn>
</Section>

我需要使用此标记标记用其他语言发音的单词import xml.etree.ElementTree as etree import os import sys xmlD = etree.parse(sys.stdin) root = xmlD.getroot() sections = root.getchildren()[2].getchildren() for section in sections: turns = section.getchildren() for turn in turns: speaker = turn.get('speaker') mode = turn.get('mode') childs = turn.getchildren() for child in childs: time = child.get('time') opt = child.get('desc') extent = child.get('extent') if opt == 'es' and extent == 'begin': opt = "ESP:" elif opt == "la" extent == 'begin': opt = "LAT:" elif opt == "en" extent == 'begin': opt = "ENG:" else: opt = "" if time: time = time else: time = "" print time, opt+child.tail.encode('latin-1') 例如: LANG:,但当我连续两个单词用其他语言发音时,我不知道该怎么做:spanish words ENG:hello, spanish words。语言的更改位于spanish words ENG:hello ENG:man, spanish words xml标记中。

现在,在输出我有: 我需要Eventactualment col·labora en el diari ESP:El Periódico de Catalunya.

有人可以帮助我吗?

谢谢!

1 个答案:

答案 0 :(得分:1)

你可以做类似的事情 -

print time, opt+(" " + opt).join([c.encode('latin-1').decode('latin-1') for c in child.tail.split(' ')])

而不是print声明