我使用以下代码将HTTP响应流转换为XmlDocument。
HttpWebRequest request = WebRequest.Create(url) as HttpWebRequest;
HttpWebResponse response = request.GetResponse() as HttpWebResponse;
Stream responseStream = response.GetResponseStream();
StreamReader responseReader = new StreamReader(responseStream);
String responseString = responseReader.ReadToEnd();
Console.WriteLine(responseString);
Int32 htmlTagIndex = responseString.IndexOf("<html",
StringComparison.OrdinalIgnoreCase);
XmlDocument responseXhtml = new XmlDocument();
responseString = responseString.Substring(htmlTagIndex); // MARK 1
responseString = responseString.Replace(" ", " "); // MARK 2
responseXhtml.LoadXml(responseString);
return responseXhtml;
MARK 1 行将跳过DOC类型定义行。
MARK 2 行是为了避免错误引用未声明的实体。
有没有更好的方法呢?上面的代码中有太多的字符串操作。
谢谢!
答案 0 :(得分:5)
我会直接使用HtmlAgilityPack来解析html。即使您必须将html转换为xml,也可以使用它。
using (WebClient wc = new WebClient())
{
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(wc.DownloadString("http://www.google.com"));
doc.OptionOutputAsXml = true;
StringWriter writer = new StringWriter();
doc.Save(writer);
var xDoc = XDocument.Load(new StringReader(writer.ToString()));
}