我有一个包含CDATA的XML,其中包含带有&符号的URL标记。我应该使用lxml来读取这些标签,但我收到一个错误。
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593)
File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119112)
File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:117670)
File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:111657)
File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105880)
File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:107588)
File "src\lxml\parser.pxi", line 635, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:106442)
File "<string>", line 9
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 9, column 98
我怎么能错过这个错误?我在做对吗?我们需要更换&amp;什么?
代码如下
from lxml import etree
ns0_NAMESPACE = "http://webservices.online.webapp.paperless.cl"
ns0 = "{%s}" % ns0_NAMESPACE
NSMAP = {'ns0':ns0_NAMESPACE}
response="""
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
<soapenv:Body>
<ns:OnlineGeneration2Response xmlns:ns="http://webservices.online.webapp.cl">
<ns:return>
<![CDATA[<EstadoDoc>
<Estado>Ok<Estado>
<RutEmisor>81201000-K</RutEmisor>
<TipoDte>52</TipoDte>
<FolioM>117620901</FolioM>
<Folio>25022</Folio>
<Glosa>NO INFORMADO</Glosa>
<UrlDte>http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvkPrUZDtY6hMg==</UrlDte>
</EstadoDoc>
<EstadoLote>
<UrlPdf>http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47</UrlPdf>
<UrlCaratula>http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47</UrlCaratula>
</EstadoLote>]]>
</ns:return>
</ns:OnlineGeneration2Response>
</soapenv:Body>
</soapenv:Envelope>"""
root=etree.fromstring(response)
sub_element=root.xpath('//ns0:return',namespaces=NSMAP)
print sub_element.text
if sub_element:
sub_element=sub_element[0]
EstadoDoc_root=etree.fromstring(sub_element.text)
答案 0 :(得分:1)
问题是<ns:return>
元素的文本(CDATA部分)的内容不是合法的XML。如果您在将&
替换为&
之前将其etree.fromstring
传递给<?php for($i = 1; $i < $days_count; $i++){
$make_date = date("Y-m")."-".$i;
$set_attendance_for_day=false;
foreach($attend['attendance'] as $att){
if($att['date'] == $make_date){
$set_attendance_for_day=true;
?>
<td><?php echo $att['attendance']; ?></td>
<?php } ?>
<?php } ?>
<?php if (!$set_attendance_for_day) { ?>
<td>-</td>
<?php }
else{ ?>
<td>P</td>
<?php }
?>
<?php }?>
,则解析应该会成功。
通常,将XML隐藏在CDATA部分并不是一个好主意;这只是它可能导致问题的一个问题。如果您对生成此XML的一方有任何影响,我建议您尝试让它们进行更改。
答案 1 :(得分:1)
使用XML解析器的恢复选项:
parser = etree.XMLParser(recover=True)
EstadoDoc_root = etree.fromstring(sub_element.text, parser=parser)
然后抓取URL(或将其更改为您需要的任何内容):
print [x.text for x in EstadoDoc_root.xpath('//UrlCaratula|//UrlPdf')]
['http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47',
'http://G500603svGLH:8080/Facturacion/XMLServlet?docId=']
第二个网址缺少&amp;之后的网址部分。 ......有办法避免这种情况吗?
使用html解析器来规范化和处理违规字符(注意小写标记)
from lxml import html
EstadoDoc_root = html.fromstring(sub_element)
print [x.text for x in EstadoDoc_root.xpath('//urlcaratula|//urlpdf')]
['http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47',
'http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47']