为什么这个HtmlAgilityPack操作在确实存在匹配元素时无效?

时间:2014-02-14 19:37:04

标签: c# screen-scraping html-agility-pack

我使用以下代码获取“InvalidOperationException> Message = Sequence不包含匹配元素”:

private void buttonLoadHTML_Click(object sender, EventArgs e)
{
    GetParagraphsListFromHtml(@"C:\PlatypiRUs\fitt.html");
}

// This code adapted from Kirk Woll's answer at 
   http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph-
   parsing-problem
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
    var pars = new List<string>();
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(sourceHtml);
    foreach (var par in doc.DocumentNode
        .DescendantNodes()
        .Single(x => x.Id == "body")
        .DescendantNodes()
        .Where(x => x.Name == "p"))
        //.Where(x => x.Name == "h1" || x.Name == "h2" || x.Name == "h3" || x.Name 
           == "hp" || )) <-- This is what I'd really like to do, but I don't know if   
           this is possible or, if it is, if the syntax is correct
    {
        pars.Add(par.InnerText);
    }
    // test
    foreach (string s in pars)
    {
        MessageBox.Show(s);
    }
    return pars;
}

为什么代码没有找到段落?

我真的想找到所有文本(h1..3或更高的val),但这是一个开始。

BTW:我用测试的html文件确实有一些段落元素。

更新

为回应Amy的暗示请求,并为了充分披露/终极照明,这里是整个测试html文件:

<style>
body {
    background-color: orange;
    font-family: Verdana, sans-serif;
}

h1 {
    color: Blue;   
    font-family: 'Segoe UI', Verdana, sans-serif;
}

h2 {
    color: white;    
    font-family: 'Palatino Linotype', 'Palatino', sans-serif;
}

h3 {
    display: inline-block;
}
</style>

<h1>Found in the Translation</h1>
<h2>Bilingual Editions of Classic Literature</h2>
<div><label>Contact: </label><a href="mailto:axx3andspace@gmail.com">Found in the Translation</a></div>

<h2><cite>Around the World in 80 Days</cite> by Jules Verne (French &amp; English Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495308081" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I0DOYRE" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>

<h2><cite>Gulliver's Travels</cite> by Jonathan Swift (English &amp; French Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495374688" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I5319ZO" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>

<h2><cite>Journey to the Center of the Earth</cite> by Jules Verne (French &amp; English Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495409031" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41hosXOIw8L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I6LG25M" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41qj8DfrihL._SL160_.jpg" /></a>

<h2><cite>Treasure Island</cite> by Robert Louis Stevenson (English &amp; Finnish Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495418936" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51veMV3OiOL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00IA5V4KC" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51XNUWbA07L._SL160_.jpg" /></a>

<h2><cite>Robinson Crusoe</cite> by Daniel Defoe (English &amp; French Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495448053" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51QQMRPrP9L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I9IE8OY" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5128hqiw3DL._SL160_.jpg" /></a>

<h2><cite>Don Quixote</cite> by Miguel de Cervantes Saavedra (Spanish &amp; English Side by Side)</h2>
<h3>Paperback</h3></br>
<h3>Volume I</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/149474967X" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1494803445" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1494841983" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>
<h3>Kindle</h3></br>
<h3>Volume I</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00HQMWPQ2" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00HYN2QGM" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00HLX519E" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>

<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English &amp; German Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/193659420X" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00ESLTIYQ" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>

<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English &amp; Italian Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/193659420X" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00ESLTIYQ" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>

<h2>Other Sites:</h2>
<p><a href="http://usamaporama.azurewebsites.net/"  target="_blank">USA Map-O-Rama</a></p>
<p><a href="http://www.awardwinnersonly.com/"  target="_blank">Award-winning Movies, Books, and Music</a></p>
<p><a href="http://www.bigsurgarrapata.com/"  target="_blank">Garrapata State Park in Big Sur Throughout the Seasons</a></p>

更新2

这有效(虽然它是“实时”网页,而不是保存到磁盘的html文件):

public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
    var pars = new List<string>();
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(sourceHtml);

    var getHtmlWeb = new HtmlWeb();
    var document = getHtmlWeb.Load("http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html"); 
    //http://www.bigsurgarrapata.com/ only returned one paragraph
    // http://usamaporama.azurewebsites.net/ <-- none
    // http://www.awardwinnersonly.com/ <- same as bigsurgarrapata
    var pTags = document.DocumentNode.SelectNodes("//p");
    int counter = 1;
    if (pTags != null)
    {
        foreach (var pTag in pTags)
        {
            pars.Add(pTag.InnerText);
            MessageBox.Show(pTag.InnerText);
            counter++;
        }
    }
    MessageBox.Show("done!");
    return pars;
}

1 个答案:

答案 0 :(得分:0)

事实证明这很简单;这不完整,但受this answer启发,这足以开始:

HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();

// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;

htmlDoc.Load(@"C:\Platypus\dplatypus.htm");

if (htmlDoc.DocumentNode != null)
{
    IEnumerable<HtmlAgilityPack.HtmlNode> textNodes = htmlDoc.DocumentNode.SelectNodes("//text()");
    foreach (HtmlNode node in textNodes)
    {
        if (!string.IsNullOrWhiteSpace(node.InnerText))
        {
            MessageBox.Show(node.InnerText);
        }
    }
}