我尝试将xml反序列化为对象,我遇到了xml树中各种项目编码的问题。
XML示例:
<?xml version="1.0" encoding="utf-8"?>
<results>
<FlightTravel>
<QuantityOfPassengers>6</QuantityOfPassengers>
<Id>N5GWXM</Id>
<InsuranceId>330992</InsuranceId>
<TotalTime>3h 00m</TotalTime>
<TransactionPrice>540.00</TransactionPrice>
<AdditionalPrice>0</AdditionalPrice>
<InsurancePrice>226.56</InsurancePrice>
<TotalPrice>9561.31</TotalPrice>
<CompanyName>XXXXX</CompanyName>
<TaxID>111-11-11-111</TaxID>
<InvoiceStreet>Jagiellońska</InvoiceStreet>
<InvoiceHouseNo>8</InvoiceHouseNo>
<InvoiceZipCode>Jagiellońska</InvoiceZipCode>
<InvoiceCityName>Warszawa</InvoiceCityName>
<PayerStreet>Jagiellońska</PayerStreet>
<PayerHouseNo>8</PayerHouseNo>
<PayerZipCode>11-111</PayerZipCode>
<PayerCityName>Warszawa</PayerCityName>
<PayerEmail>no-reply@xxxx.pl</PayerEmail>
<PayerPhone>123123123</PayerPhone>
<Segments>
<Segment0>
<DepartureAirport>WAW</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>07:50</DepartureTime>
<ArrivalAirport>VIE</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>09:15</ArrivalTime>
</Segment0>
<Segment1>
<DepartureAirport>VIE</DepartureAirport>
<DepartureDate>śr. 06 lip</DepartureDate>
<DepartureTime>10:00</DepartureTime>
<ArrivalAirport>SZG</ArrivalAirport>
<ArrivalDate>śr. 06 lip</ArrivalDate>
<ArrivalTime>10:50</ArrivalTime>
</Segment1>
</Segments>
</FlightTravel>
</results>
python中的XML反序列化功能:
# -*- coding: utf-8 -*-
from lxml import etree
import codecs
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True #if tag == 'Title' else False
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text.append(data)
def close(self):
return self.text
parser = etree.XMLParser(target = TitleTarget())
infile = 'Flights.xml'
results = etree.parse(infile, parser)
out = open('wynik.txt', 'w')
out.write('\n'.join(results))
out.close()
输出:
['6','N5GWXM','330992','3h 00m','540.00','0','226.56','9561.31','XXXXX','111-11-11-111' ,'Jagiello',''','ska','8','Jagiello',''','ska','Warszawa','Jagiello',''','ska','8',' 11-111','Warszawa','no-reply@xxxx.pl','123123123','WAW','ś','r。 06唇','07:50','VIE',''','r。 06唇','09:15','VIE','ś','r。 06唇','10:00','SZG','ś','r。 06唇','10:50']
项目'Jagiellońska'是特殊字符'ñ'。当解析器将数据附加到数组时,char'n'是分裂字符的王者,我的问题是为什么会发生这种情况?其余项目正确附加到数组。在项目'śr06.lip'中情况完全相同。
答案 0 :(得分:1)
问题是每个元素可能会多次调用目标类的data
方法。例如,如果馈线穿过块边界,则可能发生这种情况。看起来它也会在遇到非ASCII字符时发生。这是古老的传说。我无法找到记录的位置。但是,如果您将目标类更改为类似以下内容,它将起作用。我已根据您的数据对其进行了测试。
class TitleTarget(object):
def __init__(self):
self.text = []
def start(self, tag, attrib):
self.is_title = True #if tag == 'Title' else False
if self.is_title:
self.text.append(u'')
def end(self, tag):
pass
def data(self, data):
if self.is_title:
self.text[-1] += data
def close(self):
return self.text
为了更好地掌握输出结果,请在解析后调用print repr(results)
。您现在应该看到这样的未分割文本片段
u'Jagiello\u0144ska\n '
u'\u015br. 06 lip\n '