我有一个asp.net vb项目需要解析一些原始XML,这些XML来自数据库,XML布局如下:
<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006– </A>; Postdoctoral Fellow, Toronto Western Hosp. 2000–06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. & Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>
我正在使用的代码就是这个
Dim FullBio As New System.Xml.XmlDocument
Dim NodeList As System.Xml.XmlNodeList
Dim Node As System.Xml.XmlNode
FullBio.LoadXml(bio.Item(11))
NodeList = FullBio.SelectNodes("a")
For Each Node In NodeList
Dim name = Node.Attributes.GetNamedItem("name").Value()
lblEducation.Text = lblEducation.Text + name.ToString() + Node.InnerText + "<br />"
Next
所以XML加载到Xml文档中
FullBio.LoadXml(bio.Item(11))是我在顶部提供的XML。我收到此错误消息:
'SN' is an unexpected token. The expected token is '"' or '''. Line 1, position 49.
我知道错误是因为没有引用属性。无论如何,在将字符串加载到xmldoc之前,是否仍然要使XmlDocument理解属性或使用reg表达式向属性添加引号的简单方法?
答案 0 :(得分:2)
你拥有的是无效的XML。 XmlDocument期望输入为有效XML 。我建议您使用HTML解析器,例如Html Agility Pack来解析HTML(这是您输入的内容)。因此,例如,如果您想列出所有锚点的所有name
属性值,它就像那样简单:
using System;
using HtmlAgilityPack;
class Program
{
static void Main()
{
var document = new HtmlDocument();
document.Load("test.html");
foreach (var a in document.DocumentNode.Descendants("a"))
{
Console.WriteLine("Name: {0}", a.Attributes["name"].Value);
}
}
}
答案 1 :(得分:0)
我会写一些逻辑来在属性值周围插入引号。如果XML格式不正确,文档将加载错误。
您可以使用Html2Xhtml库。这是一个链接:
http://corsis.sourceforge.net/index.php/Html2Xhtml
您应该可以使用该库将内容放入XDocument中,如下所示:
string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";
var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);
Console.WriteLine(xdoc);
我相信Html2Xhtml支持.NET 2.0框架及以上,如果不是,我很确定以前的版本之一会,但如果没有,你可以使用它:
http://www.codeproject.com/KB/XML/HTML2XHTML.aspx
本文使用HTML Tidy,本文的源代码应该在2.0中使用。
答案 2 :(得分:0)
你也可以试试SgmlReader,非常适合这类问题。
using (var strReader = new StringReader(html))
{
using (SgmlReader sgmlReader = new SgmlReader())
{
sgmlReader.DocType = "HTML";
sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
sgmlReader.InputStream = strReader;
// create document
XmlDocument doc = new XmlDocument();
doc.PreserveWhitespace = true;
doc.Load(sgmlReader);
}
}