如何使XmlDocument使用没有引用属性的XML?

时间:2011-09-08 20:09:35

标签: asp.net regex vb.net xmldocument

我有一个asp.net vb项目需要解析一些原始XML,这些XML来自数据库,XML布局如下:

<HTML><HEAD><TITLE></TITLE></HEAD><BODY><STRONG><A name=SN>AARTS</A>, <A name=GN>Michelle Marie</A>, </STRONG><A name=HO>B.Sc.</A>, <A name=HO>M.Sc.</A>, <A name=HO>Ph.D.</A>; <A name=OC>scientist, professor</A>; b. <A name=BC>St. Marys</A>, Ont. <A name=BY>1970</A>; <A name=PA>d. Wm. and H. Aarts</A>; <A name=ED>e. Univ. of Western Ont. B.Sc.(Hons.) 1994, M.Sc. 1997</A>; <A name=ED>McGill Univ. Ph.D. 2002</A>; <A name=MA>m. L. MacManus</A>; two children; <A name=PO>CANADA RESEARCH CHAIR IN SIGNAL TRANSDUCTION IN ISCHEMIA</A> and <A name=PO>ASST. PROF., DEPT. OF BIOL. SCI., UNIV. OF TORONTO SCARBOROUGH 2006&ndash;&nbsp;&nbsp;</A>; Postdoctoral Fellow, Toronto Western Hosp. 2000&ndash;06; Expert Cons., Auris Med. SAS, Montpellier, France; mem., Centre for the Neurobiol. of Stress; named INMHA Brainstar of the Year 2003; Bd. of Dirs. &amp; Fundraising Chair, N'Sheemaehn Childcare; mem., Soc. for Neurosci.; Cdn. Physiol. Soc.; Cdn. Assn. for Neurosci.; <A name=WK>co-author: 'Therapeutic Tools in Brain Damage' in <EM>Proteomics and Protein Interactions: Biology, Chemistry, Bioinformatics and Drug Design </EM>2005; 18 pub. journal articles</A>; Office: <A name=OF1_L1>1265 Military Trail</A>, <A name=OF1_CT>Scarborough</A>, <A name=OF1_PR>Ont.</A> <A name=OF1_PC>M1C 1A4</A>. </BODY></HTML>

我正在使用的代码就是这个

        Dim FullBio As New System.Xml.XmlDocument
        Dim NodeList As System.Xml.XmlNodeList
        Dim Node As System.Xml.XmlNode

        FullBio.LoadXml(bio.Item(11))
        NodeList = FullBio.SelectNodes("a")

        For Each Node In NodeList
            Dim name = Node.Attributes.GetNamedItem("name").Value()
            lblEducation.Text = lblEducation.Text + name.ToString() + Node.InnerText + "<br />"
        Next

所以XML加载到Xml文档中

FullBio.LoadXml(bio.Item(11))
是我在顶部提供的XML。我收到此错误消息:

'SN' is an unexpected token. The expected token is '"' or '''. Line 1, position 49.

我知道错误是因为没有引用属性。无论如何,在将字符串加载到xmldoc之前,是否仍然要使XmlDocument理解属性或使用reg表达式向属性添加引号的简单方法?

3 个答案:

答案 0 :(得分:2)

你拥有的是无效的XML。 XmlDocument期望输入为有效XML 。我建议您使用HTML解析器,例如Html Agility Pack来解析HTML(这是您输入的内容)。因此,例如,如果您想列出所有锚点的所有name属性值,它就像那样简单:

using System;
using HtmlAgilityPack;

class Program
{
    static void Main()
    {
        var document = new HtmlDocument();
        document.Load("test.html");
        foreach (var a in document.DocumentNode.Descendants("a"))
        {
            Console.WriteLine("Name: {0}", a.Attributes["name"].Value);
        }
    }
}

答案 1 :(得分:0)

我会写一些逻辑来在属性值周围插入引号。如果XML格式不正确,文档将加载错误。

您可以使用Html2Xhtml库。这是一个链接:

http://corsis.sourceforge.net/index.php/Html2Xhtml

您应该可以使用该库将内容放入XDocument中,如下所示:

string html = "<html><head><TITLE>title</TITLE></head><body>I♥NY<p>b<br>c:±<img src=2 nonsense=x></a><font size=2>c</font></body></html>";

var xdoc = Html2Xhtml.RunAsFilter(stdin => stdin.Write(html)).ReadToXDocument(keepXhtmlNamespace: true);

Console.WriteLine(xdoc);

我相信Html2Xhtml支持.NET 2.0框架及以上,如果不是,我很确定以前的版本之一会,但如果没有,你可以使用它:

http://www.codeproject.com/KB/XML/HTML2XHTML.aspx

本文使用HTML Tidy,本文的源代码应该在2.0中使用。

答案 2 :(得分:0)

你也可以试试SgmlReader,非常适合这类问题。

using (var strReader = new StringReader(html))
{
    using (SgmlReader sgmlReader = new SgmlReader())
    {
        sgmlReader.DocType = "HTML";
        sgmlReader.WhitespaceHandling = WhitespaceHandling.All;
        sgmlReader.CaseFolding = Sgml.CaseFolding.ToLower;
        sgmlReader.InputStream = strReader;

        // create document
        XmlDocument doc = new XmlDocument();
        doc.PreserveWhitespace = true;
        doc.Load(sgmlReader);
    }
}