Question

我试图从这个html标签中提取文本

sometext

我有这段代码：

using System;
using System.Net;
using HtmlAgilityPack;

namespace GC_data_console
{
    class Program
    {
        public static void Main(string[] args)
        {

            using (var client = new WebClient())
            {
                // Download the HTML
                string html = client.DownloadString("https://www.requestedwebsite.com");


                HtmlDocument doc = new HtmlDocument();
                doc.LoadHtml(html);


                foreach(HtmlNode link in
                        doc.DocumentNode.SelectNodes("//span"))
                {
                    HtmlAttribute href = link.Attributes["id='example1'"];


                    if (href != null)
                    {
                    Console.WriteLine(href.Value.ToString());
                        Console.ReadLine();
                    }
                }
                }
            }
        }
    }
}

但我仍然没有收到文字＆＃34;某些文字＆＃34;。

但如果我插入HtmlAttribute href = link.Attributes [＆＃34; id＆＃34;]; 我将获得所有ID名称。

我做错了什么？

Answer 1

您首先需要了解HTML Node和HTMLAttribute之间的区别。您的代码无法解决问题。

HTMLNode表示HTML中使用的标记，例如span，div，p，a以及其他许多标记。 HTMLAttribute表示用于HTMLNodes的属性，例如href属性用于a，style，class，id，name等属性几乎用于所有HTML标签。

以下HTML

<span id="firstName" style="color:#232323">Some Firstname</span>

span是HTMLNode，而id和style是HTMLAttributes。并且您可以使用HtmlNode.InnerText属性获取值Some FirstName。

从HtmlDocument中选择HTMLNodes也不是那么简单。您需要提供适当的XPath来选择所需的节点。

现在，如果您希望获取<span id="ctl00_ContentBody_CacheName">SliverCup Studios East</span>中写入someurl.com的HTML文本的文本，则需要编写以下代码。

using (var client = new WebClient())
{
    string html = client.DownloadString("https://www.someurl.com");

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

   //Selecting all the nodes with tagname `span` having "id=ctl00_ContentBody_CacheName".
    var nodes = doc.DocumentNode.SelectNodes("//span")
        .Where(d => d.Attributes.Contains("id"))
        .Where(d => d.Attributes["id"].Value == "ctl00_ContentBody_CacheName");

    foreach (HtmlNode node in nodes)
    {
        Console.WriteLine(node.InnerText);
    }
}

上面的代码将选择直接位于HTML文档节点下的所有span标记。位于层次结构深处的标记，您需要使用不同的XPath。

这可以帮助您解决问题。

如何使用C＃从网页中提取数据

1 个答案: