HTML Agility Pack获取<p itemprop =“”> </p>的内容

时间:2013-12-08 23:01:06

标签: c# parsing xpath html-agility-pack

我正在尝试获取使用HTML敏捷包的内容。以下是我正在尝试解析的HTML示例:

         <p itemprop="articleBody">
    Hundreds of thousands of Ukrainians filled the streets of Kiev on Sunday, first to hear speeches and music and then to fan out and erect barricades in the district where government institutions have their headquarters.</p><p itemprop="articleBody">
    Carrying blue-and-yellow Ukrainian and European Union flags, the teeming crowd filled 
Independence Square, where protests have steadily gained momentum since Mr. Yanukovich refused on Nov. 21 to sign trade and political agreements with the European Union. The square has been transformed by a vast and growing tent encampment, and demonstrators have occupied City Hall and other public buildings nearby. Thousands more people gathered in other cities across the country.        </p><p itemprop="articleBody">
    “Resignation! Resignation!” people in the Kiev crowd chanted on Sunday, demanding that Mr. Yanukovich and the government led by Prime Minister Mykola Azarov leave office.        </p>

我正在尝试使用以下代码解析上面的HTML:

HtmlAgilityPack.HtmlWeb nytArticlePage = new HtmlAgilityPack.HtmlWeb();
HtmlAgilityPack.HtmlDocument nytArticleDoc = new HtmlAgilityPack.HtmlDocument();

System.Diagnostics.Debug.WriteLine(articleUrl);
nytArticleDoc = nytArticlePage.Load(articleUrl);
var articleBodyScope = 
        nytArticleDoc.DocumentNode.SelectNodes("//p[@itemprop='articleBody']");

修改

但似乎articleBodyScope是空的,因为:

if (articleBodyScope != null)
{
    System.Diagnostics.Debug.WriteLine("CONTENT NOT NULL");
    foreach (var node in articleBodyScope)
    {
        articleBodyText += node.InnerText;
    }
}

不打印“CONTENT NOT NULL”且articleBodyText仍为空。 如果有人能指出我的解决方案,我将不胜感激,提前谢谢!

1 个答案:

答案 0 :(得分:0)

纽约时报似乎确实发现你不接受他们的cookies。因此,他们会向您显示cookie警告和登录框。通过实际提供CookieContainer,您可以让.NET处理整个cookie业务,NYT实际上会向您展示其内容:

using System;
using Microsoft.VisualStudio.TestTools.UnitTesting;

namespace UnitTestProject3
{
    using System.Net;
    using System.Runtime;

    using HtmlAgilityPack;

    [TestClass]
    public class UnitTest1
    {
        [TestMethod]
        public void WhenProvidingCookiesYouSeeContent()
        {
            HtmlDocument doc = new HtmlDocument();

            WebClient wc = new WebClientEx(new CookieContainer());

            string contents = wc.DownloadString(
                "http://www.nytimes.com/2013/12/10/world/asia/thailand-protests.html?partner=rss&emc=rss&_r=1&");
            doc.LoadHtml(contents);

            var nodes = doc.DocumentNode.SelectNodes(@"//p[@itemprop='articleBody']");

            Assert.IsNotNull(nodes);
            Assert.IsTrue(nodes.Count > 0);
        }
    }

    public class WebClientEx : WebClient
    {
        public WebClientEx(CookieContainer container)
        {
            this.container = container;
        }

        private readonly CookieContainer container = new CookieContainer();

        protected override WebRequest GetWebRequest(Uri address)
        {
            WebRequest r = base.GetWebRequest(address);
            var request = r as HttpWebRequest;
            if (request != null)
            {
                request.CookieContainer = container;
            }
            return r;
        }

        protected override WebResponse GetWebResponse(WebRequest request, IAsyncResult result)
        {
            WebResponse response = base.GetWebResponse(request, result);
            ReadCookies(response);
            return response;
        }

        protected override WebResponse GetWebResponse(WebRequest request)
        {
            WebResponse response = base.GetWebResponse(request);
            ReadCookies(response);
            return response;
        }

        private void ReadCookies(WebResponse r)
        {
            var response = r as HttpWebResponse;
            if (response != null)
            {
                CookieCollection cookies = response.Cookies;
                container.Add(cookies);
            }
        }
    }
}

感谢this answer for the extended WebClient class

注意

可能违反纽约时报的使用条款,公然剥夺他们网站上的新故事。