这是最近开始困扰我的RSS Feed解析器的错误。今天早上我的四个RSS源开始抛出这个异常:
For security reasons DTD is prohibited in this XML document. To enable DTD processing set the DtdProcessing property on XmlReaderSettings to Parse and pass the settings into XmlReader.Create method.
以前的代码工作正常,但我相信导致此问题的这四个特定rss源已发生变化。使用DTD时使用DTD的东西,或者某种类型的架构更改,而我的SyndicationFeed无法解析。
所以我将代码更改为
string url = RssFeed.AbsoluteUri;
XmlReaderSettings st = new XmlReaderSettings();
st.DtdProcessing = DtdProcessing.Parse;
st.ValidationType = ValidationType.DTD;
XmlReader reader = XmlReader.Create(url,st);
SyndicationFeed feed = SyndicationFeed.Load(reader);
reader.Close();
然后我开始收到此错误:
The 'html' element is not declared.
in
System.Xml.XmlValidatingReaderImpl.ValidationEventHandling.System.Xml.IValidationEventHandling.SendEvent(Exception exception, XmlSeverityType severity) at System.Xml.Schema.BaseValidator.SendValidationEvent(String code, String arg) at System.Xml.Schema.DtdValidator.ProcessElement() at System.Xml.Schema.DtdValidator.ValidateElement() at System.Xml.Schema.DtdValidator.Validate() at System.Xml.XmlValidatingReaderImpl.ProcessCoreReaderEvent() at System.Xml.XmlValidatingReaderImpl.Read() at System.Xml.XmlReader.MoveToContent() at System.Xml.XmlReader.IsStartElement(String localname, String ns) at System.ServiceModel.Syndication.Atom10FeedFormatter.CanRead(XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load[TSyndicationFeed](XmlReader reader) at System.ServiceModel.Syndication.SyndicationFeed.Load(XmlReader reader)
我不知道这个'html'元素来自哪里,因为Feed(http://jobs.huskyenergy.com/RSS)中的feed和任何可见的dtd定义都没有提到它。我还尝试将Dtdprocessing
设置为DtdProcessing.ignore
,但会导致以下错误:
The element with name 'html' and namespace '' is not an allowed feed format.
这更令人困惑,因为命名空间是空白的,我不确定这个神放弃html元素的来源。
我非常接近编写自己的xml阅读器并抓取SyndicationFeed,但是我想确保在走这条道路之前用尽所有可能的解决方案。
其中一个RSS提供,如果这有助于任何: http://jobs.huskyenergy.com/RSS
答案 0 :(得分:3)
这是一个解决方案,它为给定的RSS URL提供新的和填充的SyndicationFeed对象:
var feedUrl = @"http://jobs.huskyenergy.com/RSS";
try
{
var webClient = new WebClient();
// hide ;-)
webClient.Headers.Add ("user-agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.2; .NET CLR 1.0.3705;)");
// fetch feed as string
var content = webClient.OpenRead(feedUrl);
var contentReader = new StreamReader(content);
var rssFeedAsString = contentReader.ReadToEnd();
// convert feed to XML using LINQ to XML and finally create new XmlReader object
var feed = SyndicationFeed.Load(XDocument.Parse(rssFeedAsString).CreateReader());
// take the info from the firdst feed entry
var firstFeedItem = feed.Items.FirstOrDefault();
Console.WriteLine(firstFeedItem.Title.Text);
Console.WriteLine(firstFeedItem.Links.FirstOrDefault().Uri.AbsoluteUri);
}
catch (Exception exception)
{
Console.WriteLine(exception.Message);
}
该网站显然只处理来自“浏览器”的调用,因此伪装代码。呼叫为一体。结果是:
Summer Student UEO Regulatory & Environment Strategy - (Calgary, AB)
http://jobs.huskyenergy.com/ca/alberta/student/jobid4444904-summer-student-ueo-regulatory--environment-strategy-jobs
WebClient class还支持对事件和任务的异步操作,因此使读者无阻塞是没有问题的。
html 问题的解释如下:网站改变了某些内容和/或它们不知何故不允许自动提要(不再)。 html 消息来自服务中断消息。我试图访问该服务(使用LINQPad使用LINQ to XML,不要怀疑转储功能):
var feedUrl = @"http://jobs.huskyenergy.com/RSS";
var feedContent = XDocument.Load(feedUrl);
feedContent.Dump();
//var feed = SyndicationFeed.Load(feedContent.CreateReader());
//feed.Dump();
得到了这个答案:
<!DOCTYPE html []>
<!--[if IE 7]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9 lt-ie8"><![endif]-->
<!--[if IE 8]><html lang="en" prefix="og: http://ogp.me/ns#" class="non-js lt-ie9"><![endif]-->
<!--[if gt IE 8]><!-->
<html lang="en" prefix="og: http://ogp.me/ns#" class="non-js">
<!--<![endif]-->
<head>
<meta charset="utf-8" />
<meta name="viewport" content="width=device-width" />
<title>
Service Interruption
</title>
<link rel="stylesheet" href="http://seostatic.tmp.com/SiteOutage/style.css" />
</head>
<body>
<p id="outageMessage">This system is currently experiencing a service interruption. <br />We apologize for any inconvenience.</p>
</body>
</html>
所以html元素显露出来。 :-)在浏览器中打开网站看起来很好,这意味着XmlReader resp。 LINQ to XML工作正常。