Question

我正在抓取页面www.thenextweb.com

我想提取所有帖子链接，文章内容，文章图片等。

我写了这段代码......

string url = TextBox1.Text.ToString();
        var webGet = new HtmlWeb();
        var document = webGet.Load(url);

        var infos = from info in document.DocumentNode.SelectNodes("//div[@class='article-listing']")
                    select new
                    {
                        Contr = info.InnerHtml
                    };

        lvLinks.DataSource = infos;
        lvLinks.DataBind();

这可以从页面中提取所有必需的信息......我在主页中使用了这个信息来使用asp.net页面中的listview控件作为

<li> <%# Eval("Contr") %> </li>

现在我想要的是一种方式，我可以将节点信息提取为我们将 infos 中的所有节点都包含在链接网址，帖子图片文本等中。

我想要一种方式，以便我可以将它们存储为URL [0]，PostContent [0]，PostImage [0]，Date [0]和URL [1]，PostContent [1]等所有这些都包含尊重的值正在存储在这些数组字符串中......每个帖子一个接一个......

它就像从信息中的内部节点一个一个地提取信息。

请建议一个方法？

Answer 1

为什么不创建一个解析HTML并将这些节点公开为属性的类。

class ArticleInfo
{
    public ArticleInfo (string html) { ... }
    public string URL { get; set; }
    public string PostContent { get; set; }
    public string PostImage { get; set; }
    public DateTime PostDate { get; set; }
}

然后你可以这样做：

var infos = from info in document.DocumentNode.SelectNodes("//div[@class='article-listing']")
            select new ArticleInfo(info.InnerHtml);

然后，如果你有一个数组`infoArray = infos.ToArray（）'，你可以这样做：

infoArray[0].URL
infoArray[0].PostDate
infoArray[1].PostContent

etc...

<强>更新

这样的事情：

class ArticleInfo
{
    private string html;

    public ArticleInfo (string html) 
    {
        this.html = html;
        URL = //code to extract and assign Url from html
        PostContent = //code to extract content from html
        PostImage = //code to extract Image from html
        PostDate = //code to extract date from html
    }

    public string URL { get; private set; }
    public string PostContent { get; private set; }
    public string PostImage { get; private set; }
    public DateTime PostDate { get; private set; }

    public string Contr { get { return html; } }
}

或者这个：

class ArticleInfo
{
    private string html;

    public ArticleInfo (string html) 
    {
        this.html = html;
    }

    public string URL { get { return /*code to extract and return Url from html*/; } }
    public string PostContent { get { return /*code to extract and return Content from html*/; } }
    public string PostImage { get { return /*code to extract and return Image from html*/; } }
    public DateTime PostDate { get { return /*code to extract and return Date from html*/; } }

    public string Contr { get { return html; } }
}

然后，您的链接查询将返回ArticleInfo的序列，而不是匿名类型。这样，您就不必为帖子的每个元素维护单独的数组。数组（或序列）中的每个项都具有属性，以便为您提供该项中的关联元素。当然，这可能不适合你想要实现的目标。我只是觉得它可能有点清洁。

使用C＃中的HtmlAgilityPack从节点列表中提取特定节点值

1 个答案: