获取SAXParseException格式不正确(令牌无效),无法解决问题

时间:2015-07-13 08:27:58

标签: xml scrapy sax

我需要在scrapy中解析一个非常大的xml。这有点像,

subset

它给了我<Result> <Node> <browseNodeId>306533011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">temperature-controllers</attribute> </browseNodeAttributes> <browseNodeName>Temperature Controllers</browseNodeName> <browseNodeStoreContextName>Temperature Controllers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,306533011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>TEMPERATURE_CONTROLLER</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>9931457011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">industrial-and-scientific-temperature-indicators</attribute> </browseNodeAttributes> <browseNodeName>Temperature Indicators</browseNodeName> <browseNodeStoreContextName>Temperature Indicators</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,9931457011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Indicators</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>5006547011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">industrial-temperature-sensors</attribute> </browseNodeAttributes> <browseNodeName>Temperature Probes & Sensors</browseNodeName> <browseNodeStoreContextName>Temperature Probes & Sensors</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,5006547011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Probes & Sensors</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>9931455011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">thermal-imagers</attribute> </browseNodeAttributes> <browseNodeName>Thermal Imagers</browseNodeName> <browseNodeStoreContextName>Thermal Imagers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,9931455011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermal Imagers</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>393280011</browseNodeId> <browseNodeAttributes count="0"/> <browseNodeName>Thermometers</browseNodeName> <browseNodeStoreContextName>Thermometers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,393280011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers</browsePathByName> <hasChildren>true</hasChildren> <childNodes count="4"> <id>393282011</id> <id>393284011</id> <id>393283011</id> <id>9931459011</id> </childNodes> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>393282011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">industrial-and-scientific-dial-thermometers</attribute> </browseNodeAttributes> <browseNodeName>Dial Thermometers</browseNodeName> <browseNodeStoreContextName>Dial Thermometers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,393280011,393282011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Dial Thermometers</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>393284011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">science-lab-digital-thermometers</attribute> </browseNodeAttributes> <browseNodeName>Digital Thermometers</browseNodeName> <browseNodeStoreContextName>Lab Digital Thermometers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,393280011,393284011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Digital Thermometers</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>LAB_SUPPLY</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>393283011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">industrial-and-scientific-glass-thermometers</attribute> </browseNodeAttributes> <browseNodeName>Glass Thermometers</browseNodeName> <browseNodeStoreContextName>Glass Thermometers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,393280011,393283011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Glass Thermometers</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> <Node> <browseNodeId>9931459011</browseNodeId> <browseNodeAttributes count="1"> <attribute name="item_type_keyword">infrared-thermometers</attribute> </browseNodeAttributes> <browseNodeName>Infrared Thermometers</browseNodeName> <browseNodeStoreContextName>Infrared Thermometers</browseNodeStoreContextName> <browsePathById>16310091,16310161,256409011,5006566011,393280011,9931459011</browsePathById> <browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Thermometers,Infrared Thermometers</browsePathByName> <hasChildren>false</hasChildren> <childNodes count="0"/> <productTypeDefinitions>PRECISION_MEASURING</productTypeDefinitions> <refinementsInformation count="0"/> </Node> </Result> 错误。由于xml文件的大小非常大,我不能选择替换每个&符号。

此刻我还没有使用scrapy实现它。虽然下面是一个简单的参考类。如何在不更换每个&符号的情况下进行故障排除。

xml.sax._exceptions.SAXParseException: nodes.xml:11:38: not well-formed (invalid token)

输出

import xml.sax


class ABContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        xml.sax.ContentHandler.__init__(self)

    def startElement(self, name, attrs):
        print("startElement '" + name + "'")
        if name == "address":
            print("\tattribute type='" + attrs.getValue("type") + "'")

    def endElement(self, name):
        print("endElement '" + name + "'")

    def characters(self, content):
        print("characters '" + content + "'")

def main(sourceFileName):
    source = open(sourceFileName)
    xml.sax.parse(source, ABContentHandler())

if __name__ == "__main__":
    main("nodes.xml")

1 个答案:

答案 0 :(得分:2)

错误显示问题所在的行和字符。它在&amp;在

<browsePathByName>Industrial & Scientific,Test, Measure & Inspect,Temperature & Humidity,Temperature Controllers</browsePathByName>

有一个&amp;有效的XML无效的问题。在其自己的。 &安培;开始一个实体

W3C Recommendation in section 2.4 Character Data and Markup

  

&符号(&amp;)和左尖括号(&lt;)不得以其文字形式出现,除非用作标记分隔符,或用于注释,处理指令或CDATA部分。如果在别处需要它们,则必须使用数字字符引用或字符串“&amp; amp;”对它们进行转义。和“&amp; lt;”分别。右尖括号(&gt;)可以使用字符串“&amp; gt;”来表示,并且为了兼容性,必须使用“&amp; gt;”来转义。或出现在字符串“]]&gt;中的字符引用在内容中,当该字符串未标记CDATA部分的结尾时。

正确的解决方法是告诉XML的作者他们的输出是无效的,他们必须修复它。

否则你必须首先解析文本并替换所有独立的文本。按&amp;