xml解析错误:在python </invalid>中格式不正确<invalid token =“”>

时间:2012-07-23 12:18:44

标签: python xml parsing sax

您好我正在努力抓取XML文件。对于HTML,我使用了scrapy和XML,我决定使用xml.sax解析它。

以下是一个示例代码(不要将其视为一个真实的例子)只是为了查看我的疑问:

from xml.sax.handler import ContentHandler
import xml.sax

xmlFilePath = 'users/documents/jobstext.xml'

try:
    parser = xml.sax.make_parser( )
    parser.parse(open(xmlFilePath))

except (xml.sax.SAXParseException), e:
        print "*** PARSER error: %s" % e
        print e,"What is the error actually >>>>"  

以下是 XML代码

<?xml version="1.0" encoding="utf-8"?>
<jobs>
  <reader><![CDATA[Identity Group]]></reader>
  <readerUrl><![CDATA[http://www.example.com]]></readerUrl>

  <job>
    <title><![CDATA[Architect - OT]]></title>
    <category><![CDATA[LTC/SNF]]></category>
    <jobId><![CDATA[139693]]></jobId>
    <specialization><![CDATA[LTC/SNF]]></specialization>
    <positionType><![CDATA[Travel]]></positionType>
    <description><![CDATA[<DIV>OT&nbsp;needed for a SNF in&nbsp;Oregon.&nbsp; Oregon is a dramatic land of many changes. From the rugged Oregon seacoast, the high mountain passes of the country for Travel Allied Professionals and Travel Nurses. Our clients are among the most prestigious healthcare facilities in the country.</DIV>
<DIV>&nbsp;</DIV>
 </description>
<P style="MARGIN: 0in 0in 0pt" class=MsoNormal><FONT size=3><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"><SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Position will manage 24 ED Rooms with 24/7 accountability<o:p></o:p></FONT></SPAN></FONT></P>
<P style="MARGIN: 0in 0in 0pt" class=MsoNormal><FONT size=3><SPAN style="FONT-FAMILY: Symbol; COLOR: black; mso-ascii-font-family: 'Times New Roman'">�</SPAN><SPAN style="COLOR: black"><FONT face="Times New Roman"> <SPAN style="mso-spacerun: yes">&nbsp;</SPAN>55 FTEs <o:p></o:p></FONT></SPAN></FONT></P>
  </job>
</jobs>

结果:

*** PARSER error: users/documents/jobstext.xml:13:150: not well-formed <invalid token>
users/documents/jobstext.xml:13:150: not well-formed <invalid token> What is the error actually >>>>

执行到达<p>标记并且索引150显示错误无效标记时发生了什么?由于您在上面的错误中可以看到这一点,我期待这个?标记。

所以任何人都可以告诉我如何在xml解析中解决not well-formed <invalid token>的错误,

如果我以错误的格式解释,我很抱歉,但希望我能很好地解释这个概念。

已编辑的代码:

<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial">THE MOST COMPETITIVE RATES IN NM .....<o:p></o:p></SPAN></P>
<P class=MsoNormal style="MARGIN: 0in 0in 0pt"><SPAN style="FONT-SIZE: 10pt; COLOR: black; FONT-FAMILY: Arial">Busy <?xml:namespace prefix = st1 ns = "urn:schemas-microsoft-com:office:smarttags" /><st1:place w:st="on"><st1:PlaceName w:st="on">Acute</st1:PlaceName> <st1:PlaceName w:st="on">Care</st1:PlaceName> <st1:PlaceType w:st="on">Hospital</st1:PlaceType></st1:place> needs Occupational Therapists.&nbsp; Experience with </SPAN><SPAN style="FONT-SIZE: 10pt; FONT-FAMILY: Arial">Ortho, Neuro, vestibular balance, aquatic a plus!<SPAN style="COLOR: black">&nbsp; New grads welcome.<SPAN style="mso-spacerun: yes">&nbsp; </SPAN>Signon Bonus and help with relocation.<SPAN style="mso-spacerun: yes">&nbsp; </SPAN>For more details please call or email Carole 800 995 2673 X1329 or <A href="mailto:cs@coremedicalgroup.com"><SPAN style="mso-bidi-font-weight: bold; mso-bidi-font-size: 12.0pt">cs@coremedicalgroup.com</SPAN></A><o:p></o:p></SPAN></SPAN></P>

2 个答案:

答案 0 :(得分:1)

您的description没有结束标记,其中的CDATA部分永远不会终止...虽然我希望它在文档末尾而不是在该元素的第三行数据上出错

答案 1 :(得分:1)

由于问题已经改变......

必须引用XML属性。

例如:class=MsoNormal应为class="MsoNormal"