Question

我正在使用sax解析器在我的应用程序中将XML解析为字符串。当我的代码将HTML主体发送为字符串时，sax解析器会卡住更长的时间（超过5小时）。

页面源网址：“http://www.cityam.com/taxonomy/term/1/all/feed”我要解析。此URL提供HTML页面而不是XML。如何处理这种问题或如何从适当的异常中退出我的saxParser。我的代码在这里

public List<RssEntry> parseDocument(String body) {
    // expected body is xml but getting stuck when get body of html page.
    SAXParserFactory factory = SAXParserFactory.newInstance();
    try {
        SAXParser parser = factory.newSAXParser();
        XMLReader reader = parser.getXMLReader();   
        parser.parse(new ByteArrayInputStream(body.getBytes("UTF-8")), this);
    }

    some catch block

请帮助我。谢谢

Answer 1

当我的代码以字符串形式发送HTML正文时，sax解析器会卡住更长的时间（超过5小时）。如果我在dtd中传递包含“http://apache.org/xml/features/nonvalidating/load-external-dtd”的html页面的主体（html页面的开头），则sax解析器忙于加载external-dtd。

所以我把这些功能设为false，然后如果没有很好地定义xml，则sax解析器会抛出错误。

XMLReader reader = parser.getXMLReader（）; reader.setFeature（ “http://apache.org/xml/features/nonvalidating/load-external-dtd”，假）;

感谢大家帮助我。

Answer 2

// expected body is xml but getting stuck when get body of html page.
SAXParserFactory factory = SAXParserFactory.newInstance();
if(!body.startsWith("<?xml")){
    throw new NotXmlInputException(message); //your exception
}

或为您的xml创建模式文件，并使用验证

SchemaFactory constraintFactory =
        SchemaFactory.newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Source constraints = new StreamSource(/* your schema */);
Schema schema = constraintFactory.newSchema(constraints);
Validator validator = schema.newValidator();

try {
    validator.validate(/* convert your string to sourse*/);
} catch (org.xml.sax.SAXException e) {
    log("Validation error: " + e.getMessage());
}

或者可能有助于使用

SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setValidating(true);

Sax解析器卡住将Html解析为字符串时

2 个答案: