如何解析此HTML以获取我想要的内容?

时间:2012-06-28 18:23:14

标签: c# html parsing

我目前正在尝试解析HTML文档以检索其中的所有脚注;该文件包含数十个和几十个。我无法弄清楚用于提取我想要的所有内容的表达式。问题是,类(例如" calibre34")都在每个文档中随机化。查看脚注所在位置的唯一方法是搜索"隐藏"它之后总是发短信并以< / TD>标签。下面是HTML文档中一个脚注的示例,我想要的只是文本。有任何想法吗?谢谢你们!

<td class="calibre33">1.<span><a class="x-xref" href="javascript:void(0);">
[hide]</a></span></td>
<td class="calibre34">
Among the other factors on which the premium would be based are the
average size of the losses experienced, a margin for contingencies,
a loading to cover the insurer's expenses, a margin for profit or
addition to the insurer's surplus, and perhaps the investment
earnings the insurer could realize from the time the premiums are
collected until the losses must be paid.</td>

2 个答案:

答案 0 :(得分:3)

使用HTMLAgilityPack加载HTML文档,然后使用此XPath提取脚注:

  

// TD [文本()= '[隐藏]'] /以下同胞:: TD

基本上,它的作用是首先选择包含td的所有[hide]个节点,然后最终选择他们的下一个兄弟节点。那么下一个td。获得此节点集合后,您可以提取其内部文本(使用HtmlAgilityPack中提供的支持在C#中)。

答案 1 :(得分:2)

如何使用MSHTML解析HTML源代码? 这是演示代码.enjoy。

public class CHtmlPraseDemo
{
    private string strHtmlSource;
    public mshtml.IHTMLDocument2 oHtmlDoc;
    public CHtmlPraseDemo(string url)
    {
        GetWebContent(url);
        oHtmlDoc = (IHTMLDocument2)new HTMLDocument();
        oHtmlDoc.write(strHtmlSource);
    }
    public List<String> GetTdNodes(string TdClassName)
    {
        List<String> listOut = new List<string>();
        IHTMLElement2 ie = (IHTMLElement2)oHtmlDoc.body;
        IHTMLElementCollection iec = (IHTMLElementCollection)ie.getElementsByTagName("td");
        foreach (IHTMLElement item in iec)
        {
            if (item.className == TdClassName)
            {
                listOut.Add(item.innerHTML);
            }
        }
        return listOut;
    }
    void GetWebContent(string strUrl)
    {
        WebClient wc = new WebClient();
        strHtmlSource = wc.DownloadString(strUrl);
    }



}

class Program
{
 static void Main(string[] args)
    {
        CHtmlPraseDemo oH = new CHtmlPraseDemo("http://stackoverflow.com/faq");

        Console.Write(oH.oHtmlDoc.title);
        List<string> l = oH.GetTdNodes("x");
        foreach (string n in l)
        {
            Console.WriteLine("new td");
            Console.WriteLine(n.ToString());

        }

        Console.Read();
    }
}