Question

我需要为2个元标记值解析一个网页。我不确定什么是解析元标记数据的网页html最有效的方法。

我可以将网页html字符串转换为xml，然后解析meta类型的标记吗？

WebClient wc = new WebClient();
wc.Headers.Set("User-Agent", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.19) Gecko/2010031422 Firefox/3.0.19 ( .NET CLR 3.5.30729; .NET4.0E)");
string html  = wc.DownloadString(String.Format("http://www.geobytes.com/IpLocator.htm?GetLocation&template=php3.txt&IpAddress={0}", ip));
XmlDocument xdoc = new XmlDocument();
xdoc.LoadXml(html);   // ERROR HERE: "The 'meta' start tag on line 23 position 2 does not match the end tag of 'head'. Line 26, position 3"
XmlNodeList interNode = xdoc.DocumentElement.SelectNodes("//meta");

我对所有C＃库都不熟悉，是否有更好的替代方法可以更容易地从返回的html中获取所有元标记

当我尝试解析html时，我收到错误：

第23行第2位的'meta'开始标记与结束标记不匹配 '头'。第26行，第3位

Answer 1

我建议HTML Agility Pack。它可以很好地处理格式错误的HTML，同时为您提供XPath的功能来隔离节点/值。

您的选择类似于（使用.Net 4.0）：

var nodes = doc.DocumentNode.SelectNodes("//meta");

Answer 2

您可以使用HTML解析器而不是XML解析器，您可以在将其解析为XML之前操作该字符串，或者您可以使用正则表达式。他们适合这种情况。因此，假设导入System.Text.RegularExpressions：

Regex metaTag = new Regex(@"<meta name=\"(.+?)\" content=\"(.+?)\">");
Dictionary<string, string> metaInformation = new Dictionary<string, string>();

foreach(Match m in metaTag.Matches(html)) {
    metaInformation.Add(m.Groups[1].Value, m.Groups[2].Value);
}

现在，您只需访问metaInformation["meta name"]的任何元数据。

从下载的HTML文件中提取元标记的最简单方法

2 个答案: