Question

我有一个字符串，它从从给定Url下载的数据中获取XML和HTML输入。我想在通过SAXParser解析之前检查下载的字符串是否是html文档的rss提要。怎么找到这个？

例如

如果我从http://rss.cnn.com/rss/edition.rss下载数据，则生成的字符串是rss feed

如果我从http://edition.cnn.com/2014/06/19/opinion/iraq-neocons-wearing/index.html下载数据，则生成的字符串是html文档。

如果只有字符串是rss Feed，我想继续我的过程。

Answer 1

RSS和HTML都是XML的子集。因此，您可以将数据作为XML获取并根据RSS XSD进行验证。像这样。

URL schemaFile = new URL("http://europa.eu/rapid/conf/RSS20.xsd");
Source xmlFile = new StreamSource(YOUR_URL_HERE);
SchemaFactory schemaFactory = SchemaFactory
    .newInstance(XMLConstants.W3C_XML_SCHEMA_NS_URI);
Schema schema = schemaFactory.newSchema(schemaFile);
Validator validator = schema.newValidator();
try {
  validator.validate(xmlFile);
  // at this line you can be sure it's RSS 2.0 stream
} catch (SAXException e) {
  // NOT RSS
}

如果要检查String，可以检查它是否为典型的rss结构，如root元素，必需元素。但我不推荐它。

如何找到给定的字符串是否是RSS提要

1 个答案: