我使用以下代码获取“InvalidOperationException> Message = Sequence不包含匹配元素”:
private void buttonLoadHTML_Click(object sender, EventArgs e)
{
GetParagraphsListFromHtml(@"C:\PlatypiRUs\fitt.html");
}
// This code adapted from Kirk Woll's answer at
http://stackoverflow.com/questions/4752840/html-agility-pack-c-sharp-paragraph-
parsing-problem
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
foreach (var par in doc.DocumentNode
.DescendantNodes()
.Single(x => x.Id == "body")
.DescendantNodes()
.Where(x => x.Name == "p"))
//.Where(x => x.Name == "h1" || x.Name == "h2" || x.Name == "h3" || x.Name
== "hp" || )) <-- This is what I'd really like to do, but I don't know if
this is possible or, if it is, if the syntax is correct
{
pars.Add(par.InnerText);
}
// test
foreach (string s in pars)
{
MessageBox.Show(s);
}
return pars;
}
为什么代码没有找到段落?
我真的想找到所有文本(h1..3或更高的val),但这是一个开始。
BTW:我用测试的html文件确实有一些段落元素。
为回应Amy的暗示请求,并为了充分披露/终极照明,这里是整个测试html文件:
<style>
body {
background-color: orange;
font-family: Verdana, sans-serif;
}
h1 {
color: Blue;
font-family: 'Segoe UI', Verdana, sans-serif;
}
h2 {
color: white;
font-family: 'Palatino Linotype', 'Palatino', sans-serif;
}
h3 {
display: inline-block;
}
</style>
<h1>Found in the Translation</h1>
<h2>Bilingual Editions of Classic Literature</h2>
<div><label>Contact: </label><a href="mailto:axx3andspace@gmail.com">Found in the Translation</a></div>
<h2><cite>Around the World in 80 Days</cite> by Jules Verne (French & English Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495308081" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I0DOYRE" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51BCZUX2-dL._SL160_.jpg" /></a>
<h2><cite>Gulliver's Travels</cite> by Jonathan Swift (English & French Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495374688" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I5319ZO" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/517O76OyaWL._SL160_.jpg" /></a>
<h2><cite>Journey to the Center of the Earth</cite> by Jules Verne (French & English Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495409031" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41hosXOIw8L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I6LG25M" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/41qj8DfrihL._SL160_.jpg" /></a>
<h2><cite>Treasure Island</cite> by Robert Louis Stevenson (English & Finnish Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495418936" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51veMV3OiOL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00IA5V4KC" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51XNUWbA07L._SL160_.jpg" /></a>
<h2><cite>Robinson Crusoe</cite> by Daniel Defoe (English & French Side by Side)</h2>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1495448053" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51QQMRPrP9L._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00I9IE8OY" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5128hqiw3DL._SL160_.jpg" /></a>
<h2><cite>Don Quixote</cite> by Miguel de Cervantes Saavedra (Spanish & English Side by Side)</h2>
<h3>Paperback</h3></br>
<h3>Volume I</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/149474967X" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1494803445" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/1494841983" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>
<h3>Kindle</h3></br>
<h3>Volume I</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00HQMWPQ2" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51HqjOPXLVL._SL160_.jpg" /></a>
<h3>Volume II</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00HYN2QGM" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51NONygEMYL._SL160_.jpg" /></a>
<h3>Volume III</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00HLX519E" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51G%2BW3ICHkL._SL160_.jpg" /></a></br>
<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English & German Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/193659420X" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00ESLTIYQ" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>
<h2><cite>Alice's Adventures in Wonderland</cite> by Lewis Carroll (English & Italian Side by Side)</h2>
<h3>Coming soon; for now, see:</h3></br/>
<h3>Paperback</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/193659420X" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/5143vIpQ2YL._SL160_.jpg" /></a>
<h3>Kindle</h3>
<a href="https://rads.stackoverflow.com/amzn/click/com/B00ESLTIYQ" rel="nofollow noreferrer" target="_blank"><img height="160" width="107" src="http://ecx.images-amazon.com/images/I/51%2BX0Dy7uNL._SL160_.jpg" /></a>
<h2>Other Sites:</h2>
<p><a href="http://usamaporama.azurewebsites.net/" target="_blank">USA Map-O-Rama</a></p>
<p><a href="http://www.awardwinnersonly.com/" target="_blank">Award-winning Movies, Books, and Music</a></p>
<p><a href="http://www.bigsurgarrapata.com/" target="_blank">Garrapata State Park in Big Sur Throughout the Seasons</a></p>
这有效(虽然它是“实时”网页,而不是保存到磁盘的html文件):
public List<string> GetParagraphsListFromHtml(string sourceHtml)
{
var pars = new List<string>();
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(sourceHtml);
var getHtmlWeb = new HtmlWeb();
var document = getHtmlWeb.Load("http://www.montereycountyweekly.com/opinion/letters/article_e333a222-942d-11e3-ba9c-001a4bcf6878.html");
//http://www.bigsurgarrapata.com/ only returned one paragraph
// http://usamaporama.azurewebsites.net/ <-- none
// http://www.awardwinnersonly.com/ <- same as bigsurgarrapata
var pTags = document.DocumentNode.SelectNodes("//p");
int counter = 1;
if (pTags != null)
{
foreach (var pTag in pTags)
{
pars.Add(pTag.InnerText);
MessageBox.Show(pTag.InnerText);
counter++;
}
}
MessageBox.Show("done!");
return pars;
}
答案 0 :(得分:0)
事实证明这很简单;这不完整,但受this answer启发,这足以开始:
HtmlAgilityPack.HtmlDocument htmlDoc = new HtmlAgilityPack.HtmlDocument();
// There are various options, set as needed
htmlDoc.OptionFixNestedTags = true;
htmlDoc.Load(@"C:\Platypus\dplatypus.htm");
if (htmlDoc.DocumentNode != null)
{
IEnumerable<HtmlAgilityPack.HtmlNode> textNodes = htmlDoc.DocumentNode.SelectNodes("//text()");
foreach (HtmlNode node in textNodes)
{
if (!string.IsNullOrWhiteSpace(node.InnerText))
{
MessageBox.Show(node.InnerText);
}
}
}